Skip to content

adrieljss/gitdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitDB Logo

A version-controlled key-value database in C++23

Branch, merge, diff, and time-travel your data — built on persistent copy-on-write B-trees with Git-style commits.

Getting Started · Try the REPL · Use as a Library · How It Works


Branch and Size Benchmarks

Throughput and Latency Benchmarks

Why GitDB?

Traditional KV stores are destructive, a put overwrites, a delete erases.
GitDB allows you to time travel through your changes without any actions from your part.
Thus, making it inherently non-destructive.

Internally, GitDB uses a modified BTree which will be explained on the How It Works section.
What this means is that when branching and saving a snapshot (commit) of the database, it will cost nothing (as opposed to a traditional database which you will need to copy the whole DB).

Use Cases vs. Traditional Database

  • Need to test out 100 versions of a database? - just branch it, it will cost nothing (as opposed to a traditional database, you will need to copy the db 100 times).
  • Non-lossy rollbacks - rollback at any timestamp, no snapshot interval needed.
  • Query with time - GitDB's structure allows you to query a data's state at a specific timestamp. (commit checkout first and then do get(key) or just do get(key, timestamp)).
  • and much more!

Getting Started

Requirements: CMake 3.16+ and a C++23 compiler (GCC, Clang, or MSVC).

No external dependencies, SHA-256 is bundled.

# configure
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release

# build
cmake --build build --config Release -- -j 4

This produces three binaries:

Binary Description
build/gitdb_repl Interactive REPL for exploring GitDB
build/gitdb_tests Test suite (GoogleTest)
build/gitdb_basic_usage Example program (requires -DGITDB_BUILD_EXAMPLES=ON)

REPL

The fastest way to try GitDB. Run it and start writing data:

./build/gitdb_repl

By default it opens a repository at .gitdb in the current directory. Use repo init <path> to create one elsewhere.

Commit early, commit often. Commits are cheap, they're what enable time travel, branching, and diffing. Commit every time you make a change.

Example session

repo init ./mydb
put "user:1" "alice"
put "user:2" "bob"
commit create "add users"

scan "user:"                     # prefix scan
get "user:1"                     # → alice

put "user:1" "newalice"
commit create "edit alice"

get "user:1" 2025-06-15T10:00:00Z   # time travel query

branch create feature <HEAD_HEX>
branch checkout feature
put "user:3" "charlie"
commit create "add charlie"

diff refs main feature           # what changed between branches
merge theirs feature "merge"     # merge feature into main

commit log                       # view full history
match "^user:\\d+$"              # regex search across all keys
repo gc                          # reclaim unreferenced objects

Tips

  • Timestamps must be ISO-8601 UTC: YYYY-MM-DDTHH:MM:SSZ
  • Strings can be quoted with "..." and support escapes (\n, \t, \", \\)
  • scan <prefix> is fast (seeks into the B-tree). match <regex> traverses all nodes, prefer scan when a prefix query is enough.

Library Usage

Link against libgitdb and include the single header. Most of the time you only need the Repository class.

#include "gitdb/gitdb.hpp"
using namespace gitdb;

int main() {
    // open or create a repository
    Repository repo("./.gitdb");

    // write data and commit
    repo.put("user:1", "alice");
    repo.put("user:2", "bob");
    auto h = repo.commit("add users").value();

    // read
    auto val = repo.get("user:1");
    // val → optional<Bytes> containing "alice"

    // branch and modify
    repo.createBranch("feature");
    repo.checkout("feature");
    repo.put("user:3", "charlie");
    repo.commit("add charlie");

    // merge back
    repo.checkout("main");
    repo.setConflictResolver([](const Conflict& c) {
        return std::optional<Bytes>(c.ours);
    });
    repo.merge("feature", "merge feature");

    // time travel, read data at a previous commit
    auto snap = repo.at(h).value();
    auto old_val = snap.get("user:1");  // "alice" at that point in time

    // diff between any two refs
    auto changes = repo.diff("main~2", "main");
}

API Overview

KV Operations

Method Description
get(key) Retrieve value from current branch head
get(key, timestamp) Retrieve value at a specific point in time
put(key, value) Insert or update (staged until commit)
del(key) Delete a key (staged until commit)
scan(begin, end) Range scan over sorted keys
contains(key) Check if a key exists

Version Control

Method Description
commit(message) Snapshot the working tree
log(max_count) Commit history (newest first)
at(commit_hash) Read-only snapshot at a commit
diff(from, to) Differences between two commits or refs

Branching & Merging

Method Description
createBranch(name) Create a branch at HEAD
checkout(name) Switch branches
merge(branch, message) Three-way merge into current branch
setConflictResolver(fn) Register a conflict handler
tag(name) Tag the current HEAD commit

How It Works (The fun stuff)

Persistent copy-on-write B-trees

Every put creates new nodes along the path from leaf to root, producing a new root hash. Old nodes are never modified.

Graph Illustration

A branch is just a pointer to a root hash. A commit is a root hash + parent pointer + metadata. Unchanged subtrees are shared automatically through content-addressing.

Content-addressable object store

Every node, blob, and commit is stored as SHA-256(bytes) → bytes. This gives you deduplication (shared subtrees stored once), integrity checking (rehash to verify), and immutability (changing a byte changes the hash).

Atomicity

GitDB ensures that a crash at any point can't corrupt your data. This works through two mechanisms:

Atomic file writes. Every write to disk goes through a write-then-rename pattern: data is first written to a temporary file (with a random suffix), then atomically renamed to the final path. On POSIX systems, rename() is guaranteed to be atomic, the file either fully appears at the new path or doesn't. This means a crash mid-write can never produce a partially-written object or ref.

Write ordering. During a commit(), objects (new B-tree nodes) are written to the store first, and the branch ref is updated last. If the process crashes after writing objects but before updating the ref, the old commit is still valid, the branch still points to its previous commit, and the orphaned objects are just harmless wasted space. The ref update is the single atomic "switch" that makes the new commit visible.

Concurrency

GitDB uses a single-writer / multi-reader locking model with std::shared_mutex at two levels, plus an I/O staging layer optimized for commits:

Repository level. Reads (get, contains, scan) acquire a shared lock, any number of readers can run concurrently. Writes (put, del, commit) acquire an exclusive lock, blocking all other access for the duration of the operation.

Storage level. The disk backend has its own std::shared_mutex to make file I/O thread-safe. Object reads take a shared lock; object writes and ref updates take an exclusive lock. Mutations first go to an in‑memory staging area; on commit() only the objects reachable from the new B-tree root are persisted (“reachable‑only flush”).

Because nodes are immutable, a reader that grabbed a root hash before a write started can safely keep reading from the old tree, the writer doesn't invalidate anything, it only creates new nodes alongside the existing ones.

Maintenance: Garbage Collection (GC)

  • What it does: walks all live roots (branch heads and tags), traverses commits and B‑tree nodes/blobs to compute the reachable set, and deletes unreferenced objects from the object store.

  • When to run: after heavy write/benchmark sessions to reclaim intermediate objects not referenced by any commit.

  • Roots: branch heads and tags. If you want to keep a specific commit that’s not on any branch, create a tag for it before running GC.

  • REPL command:

    repo gc
    

    Prints the number of objects and bytes reclaimed.

Staging & Reachable‑Only Flush

  • Puts and deletes are staged in memory; reads see staged data immediately.
  • On commit() GitDB traverses the new root and persists only the reachable nodes and value blobs, then atomically updates the branch ref. This reduces on‑disk churn and keeps sizes close to the snapshot’s true footprint.

On-disk layout

.gitdb/
├── objects/          # content-addressed nodes and blobs
│   ├── a4/
│   │   └── a4f3c7b...
│   └── b2/
│       └── b21e84a...
├── refs/
│   ├── heads/        # branches → commit hashes
│   │   └── main
│   └── tags/
│       └── v1.0
└── HEAD              # current branch

Complexity

Operation Time
get(key) O(log n)
get(key, timestamp) O(log c + log n)
put(key, value) O(B · log n)
delete(key) O(B · log n)
scan(begin, end) O(log n + k)
match(regex) O(n) + regex cost
gc() O(L + A)
diff(c1, c2) O(m)

n = total keys, c = commits, B = tree order, k = results in range, m = changed nodes. L = reachable (live) objects, A = all stored objects.

Tests

./build/gitdb_tests

Project Structure

gitdb/
├── include/gitdb/
│   ├── gitdb.hpp              # single-include header
│   ├── types.hpp              # Hash, Bytes, Error, Timestamp, Conflict
│   ├── node.hpp               # B-tree node
│   ├── btree.hpp              # persistent CoW B-tree engine
│   ├── iterator.hpp           # range scan iterator
│   ├── commit.hpp             # immutable commit object
│   ├── branch.hpp             # branch manager
│   ├── diff.hpp               # structural diff engine
│   ├── merge.hpp              # three-way merge
│   ├── repository.hpp         # public API facade
│   ├── garbage_collector.hpp  # GC API (reachability sweep)
│   ├── detail/sha256.hpp      # bundled SHA-256
│   ├── storage/
│   │   ├── backend.hpp        # abstract storage interface
│   │   ├── disk.hpp           # filesystem backend
│   │   └── memory.hpp         # in-memory backend (testing)
│   └── repl/                  # REPL command handlers
├── src/                       # implementations
├── benchmarks/
│   └── bench_compare.py       # Python benchmarks (plots and comparisons)
├── tests/                     # GoogleTest suite
├── examples/
│   └── basic_usage.cpp
└── CMakeLists.txt

REPL Command Reference

Area Command Usage Description
KV put put <key> <value> Insert or update a key in the working tree
KV get get <key> [iso8601] Get a value at HEAD, or at a specific timestamp
KV del del <key> Delete a key from the working tree
KV scan scan <begin> [end] Prefix or range scan (inclusive end)
KV match match <regex> Regex match on keys (traverses all nodes)
KV root root Print the current HEAD commit hash
Commit create commit create <msg> Create a commit and advance HEAD
Commit show commit show <hex> Display commit metadata
Commit checkout commit checkout <hex> Set working tree to a commit (detached)
Commit log commit log List commit history on current branch
Commit head commit head Show HEAD commit summary
Branch create branch create <n> <hex> Create a branch at the given commit
Branch checkout branch checkout <n> Switch to a branch
Branch list branch list List all branches (* marks current)
Branch head branch head Print current branch and its head commit
Branch cc branch cc <n> <hex> Create and checkout in one step
Repo init repo init <path> Create a new repository
Repo open repo open <path> Open an existing repository
Repo delete repo delete <path> Delete a repository directory
Repo gc repo gc Sweep unreferenced objects (garbage collector)
Diff diff diff <from-hex> <to-hex> Diff between two commits
Diff refs diff refs <from> <to> Diff using branch/tag names (~N supported)
Merge merge merge <branch> <msg> Merge branch into HEAD
Merge ours merge ours <branch> <msg> Merge, preferring ours on conflicts
Merge theirs merge theirs <branch> <msg> Merge, preferring theirs on conflicts

Benchmarks

The Python harness in benchmarks/ measures put/get throughput, scan/match latencies, branch costs, and size growth. It can compare GitDB with PostgreSQL, SQLite, and Redis.

  • Dependencies:

    • Required: Python 3.8+, sqlite3 (stdlib)
    • Optional: python-dotenv (auto-loads benchmarks/.env), psycopg2-binary (PostgreSQL), redis (redis-py), matplotlib (plots)
  • Configure services in benchmarks/.env (examples):

    POSTGRES_DSN=host=127.0.0.1 dbname=bench user=postgres password=secret
    REDIS_URL=redis://127.0.0.1:6379/15
    GITDB_BENCH_N=10000
    GITDB_BENCH_VAL_SIZE=32
    BENCH_PERSIST=0
    
  • Run:

    py benchmarks/bench_compare.py
    
  • Output:

    • Prints per‑DB times for put/get/scan/match and branch (single).
    • Runs branch ×200 per DB to show branching scalability and prints “size (after x200) …”.
    • Saves plots under benchmarks/tmp/:
      • bench_branch_size.png — branch latency (1 vs ×200) and size before/after ×200
      • bench_throughput_latency.png — put/get throughput and scan/match latencies

Notes:

  • The harness clears state each run:
    • GitDB: fresh repo under benchmarks/tmp unless BENCH_PERSIST=1
    • PostgreSQL: drops all kv% tables
    • SQLite: removes the db file
    • Redis: deletes keys in both the base and branch prefixes

License

MIT, see LICENSE

About

Key-value database built in C++23 with complete version tracking (zero-cost branching) and query by timestamp while efficiently handling memory.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors