Building a Distributed Key-Value Store from Scratch (Part 1)

2025-05-26 • 3 min read

A deep dive into building a distributed key-value store from the ground up.

I’ve always been fascinated by how databases work under the hood – how they store data, ensure durability, and scale for massive workloads. To deepen my understanding and sharpen my systems programming skills, I set out to build a distributed key-value store from scratch. This series will document the process, from in-memory storage to persistence and distributed architecture.

This first post introduces the goals, roadmap, and progress so far.

For anyone interested, you can feel free to check out the full code on GitHub: https://github.com/TravisBubb/cpp-kv.

Project Goals

I have several goals for this endeavor but here are the big ones:

Understand the basics of storage engines.
1. Write-ahead logging (WAL)
2. Serialization & Deserialization
3. Indexing & Compaction
Explore distributed storage challenges.
1. Replication & Sharding
2. Consensus algorithms
3. Gossip protocols
Learn by building – not just reading articles or watching videos.
Document the process for others (and for future me).

Roadmap and Milestones

Phase 1 – In-Memory Store (MVP)

Implement basic HashMap-style interface
Support for Set, Get, and Remove operations

Phase 2 – Write-Ahead Log (WAL)

Serialize Set and Delete operations to disk
Append entries to a binary log file
Create interface to replay entries using a callback

Phase 3 – Distributed Foundation

Storage node with gRPC interface for performing Set and Remove operations.
Coordinator node with simple API to route requests to storage nodes based on a sharding key.
Basic cluster discovery/configuration mechanism
Leader-follower data & log replication

Phase 4 – Persistence

Memtable + SSTable-based persistence layer
WAL compaction
SSTable compaction

Phase 5 – Advanced Distributed Features

Consistent hashing
Explore Raft/Paxos consensus
Cluster bootstrapping & dynamic discovery
Cluster metrics & monitoring

Progress So Far

I’ve completed the in-memory data store and added a basic write-ahead log (WAL) system.

Before applying an operation to memory, I serialize and write it to disk.
Each WAL entry is a binary-encoded operation (Set or Remove) with a key and optional value.
Entries are replayable via a streaming callback interface.
- This allows easy reconstruction of the in-memory map from the log on startup.

What’s Next?

Next up:

Implementing WAL recovery on startup.
Handling file truncation and corruption.
Exploring log compaction to reduce disk usage.

Stay Tuned

This project is ongoing, and I’ll continue posting as I make progress. If you’re interested in databases, systems programming, or building backend infrastructure, I hope you’ll follow along!

👉 GitHub Repo
👉 [More posts coming soon!]

←