GitHub - adrieljss/gitdb: Key-value database built in C++23 with complete version tracking (zero-cost branching) and query by timestamp while efficiently handling memory.

A version-controlled key-value database in C++23

Branch, merge, diff, and time-travel your data — built on persistent copy-on-write B-trees with Git-style commits.

Getting Started · Try the REPL · Use as a Library · How It Works

Why GitDB?

Traditional KV stores are destructive, a put overwrites, a delete erases.
GitDB allows you to time travel through your changes without any actions from your part.
Thus, making it inherently non-destructive.

Internally, GitDB uses a modified BTree which will be explained on the How It Works section.
What this means is that when branching and saving a snapshot (commit) of the database, it will cost nothing (as opposed to a traditional database which you will need to copy the whole DB).

Use Cases vs. Traditional Database

Need to test out 100 versions of a database? - just branch it, it will cost nothing (as opposed to a traditional database, you will need to copy the db 100 times).
Non-lossy rollbacks - rollback at any timestamp, no snapshot interval needed.
Query with time - GitDB's structure allows you to query a data's state at a specific timestamp. (commit checkout first and then do get(key) or just do get(key, timestamp)).
and much more!

Getting Started

Requirements: CMake 3.16+ and a C++23 compiler (GCC, Clang, or MSVC).

No external dependencies, SHA-256 is bundled.

# configure
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release

# build
cmake --build build --config Release -- -j 4

This produces three binaries:

Binary	Description
`build/gitdb_repl`	Interactive REPL for exploring GitDB
`build/gitdb_tests`	Test suite (GoogleTest)
`build/gitdb_basic_usage`	Example program (requires `-DGITDB_BUILD_EXAMPLES=ON`)

REPL

The fastest way to try GitDB. Run it and start writing data:

./build/gitdb_repl

By default it opens a repository at .gitdb in the current directory. Use repo init <path> to create one elsewhere.

Commit early, commit often. Commits are cheap, they're what enable time travel, branching, and diffing. Commit every time you make a change.

Example session

repo init ./mydb
put "user:1" "alice"
put "user:2" "bob"
commit create "add users"

scan "user:"                     # prefix scan
get "user:1"                     # → alice

put "user:1" "newalice"
commit create "edit alice"

get "user:1" 2025-06-15T10:00:00Z   # time travel query

branch create feature <HEAD_HEX>
branch checkout feature
put "user:3" "charlie"
commit create "add charlie"

diff refs main feature           # what changed between branches
merge theirs feature "merge"     # merge feature into main

commit log                       # view full history
match "^user:\\d+$"              # regex search across all keys
repo gc                          # reclaim unreferenced objects

Tips

Timestamps must be ISO-8601 UTC: YYYY-MM-DDTHH:MM:SSZ
Strings can be quoted with "..." and support escapes (\n, \t, \", \\)
scan <prefix> is fast (seeks into the B-tree). match <regex> traverses all nodes, prefer scan when a prefix query is enough.

Library Usage

Link against libgitdb and include the single header. Most of the time you only need the Repository class.

#include "gitdb/gitdb.hpp"
using namespace gitdb;

int main() {
    // open or create a repository
    Repository repo("./.gitdb");

    // write data and commit
    repo.put("user:1", "alice");
    repo.put("user:2", "bob");
    auto h = repo.commit("add users").value();

    // read
    auto val = repo.get("user:1");
    // val → optional<Bytes> containing "alice"

    // branch and modify
    repo.createBranch("feature");
    repo.checkout("feature");
    repo.put("user:3", "charlie");
    repo.commit("add charlie");

    // merge back
    repo.checkout("main");
    repo.setConflictResolver([](const Conflict& c) {
        return std::optional<Bytes>(c.ours);
    });
    repo.merge("feature", "merge feature");

    // time travel, read data at a previous commit
    auto snap = repo.at(h).value();
    auto old_val = snap.get("user:1");  // "alice" at that point in time

    // diff between any two refs
    auto changes = repo.diff("main~2", "main");
}

API Overview

KV Operations

Method	Description
`get(key)`	Retrieve value from current branch head
`get(key, timestamp)`	Retrieve value at a specific point in time
`put(key, value)`	Insert or update (staged until commit)
`del(key)`	Delete a key (staged until commit)
`scan(begin, end)`	Range scan over sorted keys
`contains(key)`	Check if a key exists

Version Control

Method	Description
`commit(message)`	Snapshot the working tree
`log(max_count)`	Commit history (newest first)
`at(commit_hash)`	Read-only snapshot at a commit
`diff(from, to)`	Differences between two commits or refs

Branching & Merging

Method	Description
`createBranch(name)`	Create a branch at HEAD
`checkout(name)`	Switch branches
`merge(branch, message)`	Three-way merge into current branch
`setConflictResolver(fn)`	Register a conflict handler
`tag(name)`	Tag the current HEAD commit

How It Works (The fun stuff)

Persistent copy-on-write B-trees

Every put creates new nodes along the path from leaf to root, producing a new root hash. Old nodes are never modified.

A branch is just a pointer to a root hash. A commit is a root hash + parent pointer + metadata. Unchanged subtrees are shared automatically through content-addressing.

Content-addressable object store

Every node, blob, and commit is stored as SHA-256(bytes) → bytes. This gives you deduplication (shared subtrees stored once), integrity checking (rehash to verify), and immutability (changing a byte changes the hash).

Atomicity

GitDB ensures that a crash at any point can't corrupt your data. This works through two mechanisms:

Atomic file writes. Every write to disk goes through a write-then-rename pattern: data is first written to a temporary file (with a random suffix), then atomically renamed to the final path. On POSIX systems, rename() is guaranteed to be atomic, the file either fully appears at the new path or doesn't. This means a crash mid-write can never produce a partially-written object or ref.

Write ordering. During a commit(), objects (new B-tree nodes) are written to the store first, and the branch ref is updated last. If the process crashes after writing objects but before updating the ref, the old commit is still valid, the branch still points to its previous commit, and the orphaned objects are just harmless wasted space. The ref update is the single atomic "switch" that makes the new commit visible.

Concurrency

GitDB uses a single-writer / multi-reader locking model with std::shared_mutex at two levels, plus an I/O staging layer optimized for commits:

Repository level. Reads (get, contains, scan) acquire a shared lock, any number of readers can run concurrently. Writes (put, del, commit) acquire an exclusive lock, blocking all other access for the duration of the operation.

Storage level. The disk backend has its own std::shared_mutex to make file I/O thread-safe. Object reads take a shared lock; object writes and ref updates take an exclusive lock. Mutations first go to an in‑memory staging area; on commit() only the objects reachable from the new B-tree root are persisted (“reachable‑only flush”).

Because nodes are immutable, a reader that grabbed a root hash before a write started can safely keep reading from the old tree, the writer doesn't invalidate anything, it only creates new nodes alongside the existing ones.

Maintenance: Garbage Collection (GC)

What it does: walks all live roots (branch heads and tags), traverses commits and B‑tree nodes/blobs to compute the reachable set, and deletes unreferenced objects from the object store.
When to run: after heavy write/benchmark sessions to reclaim intermediate objects not referenced by any commit.
Roots: branch heads and tags. If you want to keep a specific commit that’s not on any branch, create a tag for it before running GC.
REPL command:
```
repo gc
```
Prints the number of objects and bytes reclaimed.

Staging & Reachable‑Only Flush

Puts and deletes are staged in memory; reads see staged data immediately.
On commit() GitDB traverses the new root and persists only the reachable nodes and value blobs, then atomically updates the branch ref. This reduces on‑disk churn and keeps sizes close to the snapshot’s true footprint.

On-disk layout

.gitdb/
├── objects/          # content-addressed nodes and blobs
│   ├── a4/
│   │   └── a4f3c7b...
│   └── b2/
│       └── b21e84a...
├── refs/
│   ├── heads/        # branches → commit hashes
│   │   └── main
│   └── tags/
│       └── v1.0
└── HEAD              # current branch

Complexity

Operation	Time
`get(key)`	O(log n)
`get(key, timestamp)`	O(log c + log n)
`put(key, value)`	O(B · log n)
`delete(key)`	O(B · log n)
`scan(begin, end)`	O(log n + k)
`match(regex)`	O(n) + regex cost
`gc()`	O(L + A)
`diff(c1, c2)`	O(m)

n = total keys, c = commits, B = tree order, k = results in range, m = changed nodes. L = reachable (live) objects, A = all stored objects.

Tests

./build/gitdb_tests

Project Structure

gitdb/
├── include/gitdb/
│   ├── gitdb.hpp              # single-include header
│   ├── types.hpp              # Hash, Bytes, Error, Timestamp, Conflict
│   ├── node.hpp               # B-tree node
│   ├── btree.hpp              # persistent CoW B-tree engine
│   ├── iterator.hpp           # range scan iterator
│   ├── commit.hpp             # immutable commit object
│   ├── branch.hpp             # branch manager
│   ├── diff.hpp               # structural diff engine
│   ├── merge.hpp              # three-way merge
│   ├── repository.hpp         # public API facade
│   ├── garbage_collector.hpp  # GC API (reachability sweep)
│   ├── detail/sha256.hpp      # bundled SHA-256
│   ├── storage/
│   │   ├── backend.hpp        # abstract storage interface
│   │   ├── disk.hpp           # filesystem backend
│   │   └── memory.hpp         # in-memory backend (testing)
│   └── repl/                  # REPL command handlers
├── src/                       # implementations
├── benchmarks/
│   └── bench_compare.py       # Python benchmarks (plots and comparisons)
├── tests/                     # GoogleTest suite
├── examples/
│   └── basic_usage.cpp
└── CMakeLists.txt

REPL Command Reference

Area	Command	Usage	Description
KV	put	`put <key> <value>`	Insert or update a key in the working tree
KV	get	`get <key> [iso8601]`	Get a value at HEAD, or at a specific timestamp
KV	del	`del <key>`	Delete a key from the working tree
KV	scan	`scan <begin> [end]`	Prefix or range scan (inclusive end)
KV	match	`match <regex>`	Regex match on keys (traverses all nodes)
KV	root	`root`	Print the current HEAD commit hash
Commit	create	`commit create <msg>`	Create a commit and advance HEAD
Commit	show	`commit show <hex>`	Display commit metadata
Commit	checkout	`commit checkout <hex>`	Set working tree to a commit (detached)
Commit	log	`commit log`	List commit history on current branch
Commit	head	`commit head`	Show HEAD commit summary
Branch	create	`branch create <n> <hex>`	Create a branch at the given commit
Branch	checkout	`branch checkout <n>`	Switch to a branch
Branch	list	`branch list`	List all branches (`*` marks current)
Branch	head	`branch head`	Print current branch and its head commit
Branch	cc	`branch cc <n> <hex>`	Create and checkout in one step
Repo	init	`repo init <path>`	Create a new repository
Repo	open	`repo open <path>`	Open an existing repository
Repo	delete	`repo delete <path>`	Delete a repository directory
Repo	gc	`repo gc`	Sweep unreferenced objects (garbage collector)
Diff	diff	`diff <from-hex> <to-hex>`	Diff between two commits
Diff	refs	`diff refs <from> <to>`	Diff using branch/tag names (`~N` supported)
Merge	merge	`merge <branch> <msg>`	Merge branch into HEAD
Merge	ours	`merge ours <branch> <msg>`	Merge, preferring ours on conflicts
Merge	theirs	`merge theirs <branch> <msg>`	Merge, preferring theirs on conflicts

Benchmarks

The Python harness in benchmarks/ measures put/get throughput, scan/match latencies, branch costs, and size growth. It can compare GitDB with PostgreSQL, SQLite, and Redis.

Dependencies:
- Required: Python 3.8+, sqlite3 (stdlib)
- Optional: python-dotenv (auto-loads benchmarks/.env), psycopg2-binary (PostgreSQL), redis (redis-py), matplotlib (plots)

Configure services in benchmarks/.env (examples):

POSTGRES_DSN=host=127.0.0.1 dbname=bench user=postgres password=secret
REDIS_URL=redis://127.0.0.1:6379/15
GITDB_BENCH_N=10000
GITDB_BENCH_VAL_SIZE=32
BENCH_PERSIST=0

Run:
```
py benchmarks/bench_compare.py
```
Output:
- Prints per‑DB times for put/get/scan/match and branch (single).
- Runs branch ×200 per DB to show branching scalability and prints “size (after x200) …”.
- Saves plots under benchmarks/tmp/:
  - bench_branch_size.png — branch latency (1 vs ×200) and size before/after ×200
  - bench_throughput_latency.png — put/get throughput and scan/match latencies

Notes:

The harness clears state each run:
- GitDB: fresh repo under benchmarks/tmp unless BENCH_PERSIST=1
- PostgreSQL: drops all kv% tables
- SQLite: removes the db file
- Redis: deletes keys in both the base and branch prefixes

License

MIT, see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
benchmarks		benchmarks
examples		examples
include/gitdb		include/gitdb
misc		misc
src		src
tests		tests
.clangd		.clangd
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why GitDB?

Use Cases vs. Traditional Database

Getting Started

REPL

Example session

Tips

Library Usage

API Overview

How It Works (The fun stuff)

Persistent copy-on-write B-trees

Content-addressable object store

Atomicity

Concurrency

Maintenance: Garbage Collection (GC)

Staging & Reachable‑Only Flush

On-disk layout

Complexity

Tests

Project Structure

REPL Command Reference

Benchmarks

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why GitDB?

Use Cases vs. Traditional Database

Getting Started

REPL

Example session

Tips

Library Usage

API Overview

How It Works (The fun stuff)

Persistent copy-on-write B-trees

Content-addressable object store

Atomicity

Concurrency

Maintenance: Garbage Collection (GC)

Staging & Reachable‑Only Flush

On-disk layout

Complexity

Tests

Project Structure

REPL Command Reference

Benchmarks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages