Skip to content

feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#371

Merged
adsharma merged 11 commits intomainfrom
feat/concurrent-writes2
Apr 8, 2026
Merged

feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#371
adsharma merged 11 commits intomainfrom
feat/concurrent-writes2

Conversation

@adsharma
Copy link
Copy Markdown
Contributor

@adsharma adsharma commented Apr 8, 2026

Summary

This PR ports Vela-Engineering/kuzu's concurrent multi-writer checkpoint implementation
into ladybug. The implementation was developed by Aikins Laryea (Vela Partners) across
4 phases plus a read-only lock fix; the full diff is 2,360 lines across 45 files.

Before: checkpoint drains all transactions (reads + writes) and blocks new writes
until the entire checkpoint completes.

After: checkpoint only drains write transactions, releases the write gate as soon
as the WAL is rotated, and reads continue concurrently via MVCC snapshot isolation.
Shadow pages are applied with per-page CAS locking so optimistic readers detect updates.

What changed

Core mechanism

  • WAL rotation: at checkpoint start, the active .wal is atomically renamed to
    .wal.checkpoint (frozen WAL). New write transactions immediately create a fresh
    active WAL, isolated from the checkpoint.
  • Write gate release: once WAL is rotated the gate is released; storage
    materialization runs concurrently with new writers via per-node-group locking.
  • MVCC catalog snapshots: Catalog::serializeSnapshot(snapshotTS) and
    CatalogSet::serializeSnapshot(snapshotTxn) use a pinned snapshot timestamp for
    consistent serialization after the gate is released.
  • Epoch-based change tracking: Table::hasChanges (bool) replaced by
    changeEpoch (atomic uint64). captureChangeEpochs() snapshots per-table
    watermarks under the write gate; Table::checkpoint() uses watermarks to skip
    tables that haven't changed since the last gate.

Thread safety

  • StorageManager::mtx upgraded from exclusive std::mutex to
    mutable std::shared_mutex (reads take shared lock, DDL takes unique lock).
  • NodeTable::schemaMtx protects the columns vector during concurrent
    checkpoint vacuum.
  • PageManager::version and Catalog::version use std::atomic<uint64_t>
    with a stable lastCheckpointVersion baseline.
  • ShadowFile::applyShadowPages now calls BufferManager::updateFrameIfPageIsInFrame
    (CAS-loop locked variant) instead of the lock-free variant.

WAL recovery

  • WALReplayer::replay now handles both frozen (.wal.checkpoint) and active
    (.wal) WAL files independently via replayFrozenWAL / replayActiveWAL.

Read-only databases

  • File lock is skipped when readOnly=true, enabling concurrent read-only opens.

Apple clang compatibility

  • Explicit constructors added to SegmentCheckpointState and LazySegmentData.

Testing

  • Updated test/transaction/checkpoint_test.cpp to use the new
    logCheckpointAndApplyShadowPages(bool walRotated) signature, WAL rotation-aware
    flaky checkpointer classes, and ShadowFileDatabaseIDMismatchExistingDB now
    validates the frozen WAL (.wal.checkpoint) rather than the active WAL.
  • CI workflow at embed-graph-db/.github/workflows/ladybug-concurrent-writes-ci.yml
    covers build (ubuntu + macos), full ctest suite, checkpoint/concurrent-write test
    filter, and benchmark comparison (blocks on >5% single-writer regression or
    zero read-during-checkpoint throughput).

Checkout Latest Benchmark Run

Attribution

Concurrent checkpoint implementation: Aikins Laryea, Vela Partners
(Vela-Engineering/kuzu, commits 3ad9071..d298e9c)

Ported and adapted for ladybug (lbug namespace, mainStorageManager field,
2-arg applyShadowPages API, std::format, graph catalog support) by Logan Powell.

loganpowell and others added 11 commits April 8, 2026 12:25
Ports Vela-Engineering/kuzu concurrent multi-writer support into ladybug.
Based on vela-kuzu commits 3ad9071..d298e9c relative to upstream
kuzudb/kuzu fork point 89f0263cc.

Changes
-------
WAL rotation
  checkpoint renames active WAL to .wal.checkpoint (frozen), allowing new
  write transactions to create a fresh WAL immediately

Non-blocking checkpoint
  only drains write transactions; reads continue via MVCC snapshot isolation
  (stopNewWriteTransactionsAndWaitUntilAllWriteTransactionsLeave replaces
  stopNewTransactionsAndWaitUntilAllTransactionsLeave)

MVCC catalog snapshots
  serializeSnapshot() uses a pinned snapshotTS for consistent catalog
  serialization after the write gate is released

Concurrent storage checkpoint
  epoch-based change tracking (changeEpoch replaces hasChanges bool) with
  per-table watermarks captured at drain time

Thread-safe shadow page application
  updateFrameIfPageIsInFrame with CAS loop for concurrent optimistic readers

Read-only opens
  skip file lock (LOCK_EX) when readOnly=true, allowing --readonly processes
  to open the DB concurrently with an active writer/checkpoint

WAL replayer
  handles both frozen (.wal.checkpoint) and active (.wal)

Atomic version tracking
  PageManager and Catalog use atomic<uint64_t> with lastCheckpointVersion

NodeTable schemaMtx
  shared_mutex protecting columns vector during concurrent checkpoint

StorageManager
  shared_mutex (was exclusive mutex) for getTable/serialize

Apple clang compat
  constructors for SegmentCheckpointState and LazySegmentData

Attribution: Vela-Engineering/kuzu (Aikins Laryea, Vela Partners)
Adapted for ladybug (lbug namespace, mainStorageManager field,
2-arg applyShadowPages API, std::format, graph catalog support)
…e checkpoint timeout tests

- Add Catalog::getVersionSinceCheckpoint() returning version - lastCheckpointVersion
  so CALL catalog_version() shows 0 immediately after CHECKPOINT (fixes CallCatalogVersion)
- Update CheckpointTimeoutErrorTest and AutoCheckpointTimeoutErrorTest to expect
  success (---- ok) instead of a timeout error: non-blocking checkpoint only waits
  on write transactions, so an open read-only transaction no longer blocks it
… const_cast, vacuumColumnIDs, DDL-race comment)
- Wrap pre-reload CHECKPOINT result in inner scope to prevent
  use-after-free when createDBAndConn() destroys the database
- Explicitly reset conn2/r2 before reloading DB in drain test
- Update CI workflow with correct test filter names
…ationale

The write gate (mtxForStartingNewTransactions) is correctly released after
WAL rotation. Holding it through the full checkpointStoragePhase would cause
a deadlock in tests that restart the database mid-checkpoint (e.g.
ShadowFileDatabaseIDMismatch*), since Database::~Database() calls
transactionManager->checkpoint() which would block indefinitely.

The hash-index/node-data snapshot divergence is a pre-existing limitation
of HashIndexLocalStorage having no per-entry timestamps. The correct fix
requires adding timestamp-aware snapshotting to HashIndexLocalStorage and
is tracked as a follow-up; document it in transaction_manager.cpp.

Also:
- Add rationale comment to postCheckpointCleanup explaining why no
  try/catch is intentional (data already durable at finishCheckpoint)
- Rename HashIndexConsistentAfterCheckpoint test to
  HashIndexBasicRecoveryAfterCheckpoint to reflect what it actually
  tests (baseline recovery, not the ghost-key invariant)
C++ (gist p1): wire enableMultiWrites through SystemConfig -> DBConfig
Python (gist p2): expose enable_multi_writes in py_database + database.py
New test: tools/python_api/test/test_mvcc_bank.py (Jepsen Bank MVCC suite)
Results: benchmarks/results/mvcc-bank-feat-concurrent-writes-20260402.md

All three runs passed with 0 anomalies (single-writer, 4w multi, 8w stress).
Fixes -Werror=reorder: enableMultiWrites was initialized before
autoCheckpoint in the ctor initializer list, but is declared after it
in database.h. Move enableMultiWrites to the end of the list to match
the declaration order (bufferPoolSize, maxNumThreads, enableCompression,
readOnly, maxDBSize, autoCheckpoint, ..., enableChecksums, enableMultiWrites).
@adsharma adsharma merged commit 9653f52 into main Apr 8, 2026
7 of 8 checks passed
@adsharma adsharma deleted the feat/concurrent-writes2 branch April 8, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants