Building a Distributed Key-Value Store from Scratch (Part 1)

2025-05-26 • 3 min read

A deep dive into building a distributed key-value store from the ground up.

I’ve always been fascinated by how databases work under the hood – how they store data, ensure durability, and scale for massive workloads. To deepen my understanding and sharpen my systems programming skills, I set out to build a distributed key-value store from scratch. This series will document the process, from in-memory storage to persistence and distributed architecture.

This first post introduces the goals, roadmap, and progress so far.

For anyone interested, you can feel free to check out the full code on GitHub: https://github.com/TravisBubb/cpp-kv.


Project Goals

I have several goals for this endeavor but here are the big ones:

  1. Understand the basics of storage engines.
    1. Write-ahead logging (WAL)
    2. Serialization & Deserialization
    3. Indexing & Compaction
  2. Explore distributed storage challenges.
    1. Replication & Sharding
    2. Consensus algorithms
    3. Gossip protocols
  3. Learn by building – not just reading articles or watching videos.
  4. Document the process for others (and for future me).

Roadmap and Milestones

Phase 1 – In-Memory Store (MVP)

  • Implement basic HashMap-style interface
  • Support for Set, Get, and Remove operations

Phase 2 – Write-Ahead Log (WAL)

  • Serialize Set and Delete operations to disk
  • Append entries to a binary log file
  • Create interface to replay entries using a callback

Phase 3 – Distributed Foundation

  • Storage node with gRPC interface for performing Set and Remove operations.
  • Coordinator node with simple API to route requests to storage nodes based on a sharding key.
  • Basic cluster discovery/configuration mechanism
  • Leader-follower data & log replication

Phase 4 – Persistence

  • Memtable + SSTable-based persistence layer
  • WAL compaction
  • SSTable compaction

Phase 5 – Advanced Distributed Features

  • Consistent hashing
  • Explore Raft/Paxos consensus
  • Cluster bootstrapping & dynamic discovery
  • Cluster metrics & monitoring

Progress So Far

I’ve completed the in-memory data store and added a basic write-ahead log (WAL) system.

  • Before applying an operation to memory, I serialize and write it to disk.
  • Each WAL entry is a binary-encoded operation (Set or Remove) with a key and optional value.
  • Entries are replayable via a streaming callback interface.
    • This allows easy reconstruction of the in-memory map from the log on startup.

What’s Next?

Next up:

  • Implementing WAL recovery on startup.
  • Handling file truncation and corruption.
  • Exploring log compaction to reduce disk usage.

Stay Tuned

This project is ongoing, and I’ll continue posting as I make progress. If you’re interested in databases, systems programming, or building backend infrastructure, I hope you’ll follow along!

👉 GitHub Repo
👉 [More posts coming soon!]