feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#371
Merged
feat: non-blocking concurrent checkpoint (WAL rotation + MVCC snapshot)#371
Conversation
Ports Vela-Engineering/kuzu concurrent multi-writer support into ladybug. Based on vela-kuzu commits 3ad9071..d298e9c relative to upstream kuzudb/kuzu fork point 89f0263cc. Changes ------- WAL rotation checkpoint renames active WAL to .wal.checkpoint (frozen), allowing new write transactions to create a fresh WAL immediately Non-blocking checkpoint only drains write transactions; reads continue via MVCC snapshot isolation (stopNewWriteTransactionsAndWaitUntilAllWriteTransactionsLeave replaces stopNewTransactionsAndWaitUntilAllTransactionsLeave) MVCC catalog snapshots serializeSnapshot() uses a pinned snapshotTS for consistent catalog serialization after the write gate is released Concurrent storage checkpoint epoch-based change tracking (changeEpoch replaces hasChanges bool) with per-table watermarks captured at drain time Thread-safe shadow page application updateFrameIfPageIsInFrame with CAS loop for concurrent optimistic readers Read-only opens skip file lock (LOCK_EX) when readOnly=true, allowing --readonly processes to open the DB concurrently with an active writer/checkpoint WAL replayer handles both frozen (.wal.checkpoint) and active (.wal) Atomic version tracking PageManager and Catalog use atomic<uint64_t> with lastCheckpointVersion NodeTable schemaMtx shared_mutex protecting columns vector during concurrent checkpoint StorageManager shared_mutex (was exclusive mutex) for getTable/serialize Apple clang compat constructors for SegmentCheckpointState and LazySegmentData Attribution: Vela-Engineering/kuzu (Aikins Laryea, Vela Partners) Adapted for ladybug (lbug namespace, mainStorageManager field, 2-arg applyShadowPages API, std::format, graph catalog support)
…es that differed from clang-22)
…e checkpoint timeout tests - Add Catalog::getVersionSinceCheckpoint() returning version - lastCheckpointVersion so CALL catalog_version() shows 0 immediately after CHECKPOINT (fixes CallCatalogVersion) - Update CheckpointTimeoutErrorTest and AutoCheckpointTimeoutErrorTest to expect success (---- ok) instead of a timeout error: non-blocking checkpoint only waits on write transactions, so an open read-only transaction no longer blocks it
… const_cast, vacuumColumnIDs, DDL-race comment)
- Wrap pre-reload CHECKPOINT result in inner scope to prevent use-after-free when createDBAndConn() destroys the database - Explicitly reset conn2/r2 before reloading DB in drain test - Update CI workflow with correct test filter names
…ationale The write gate (mtxForStartingNewTransactions) is correctly released after WAL rotation. Holding it through the full checkpointStoragePhase would cause a deadlock in tests that restart the database mid-checkpoint (e.g. ShadowFileDatabaseIDMismatch*), since Database::~Database() calls transactionManager->checkpoint() which would block indefinitely. The hash-index/node-data snapshot divergence is a pre-existing limitation of HashIndexLocalStorage having no per-entry timestamps. The correct fix requires adding timestamp-aware snapshotting to HashIndexLocalStorage and is tracked as a follow-up; document it in transaction_manager.cpp. Also: - Add rationale comment to postCheckpointCleanup explaining why no try/catch is intentional (data already durable at finishCheckpoint) - Rename HashIndexConsistentAfterCheckpoint test to HashIndexBasicRecoveryAfterCheckpoint to reflect what it actually tests (baseline recovery, not the ghost-key invariant)
C++ (gist p1): wire enableMultiWrites through SystemConfig -> DBConfig Python (gist p2): expose enable_multi_writes in py_database + database.py New test: tools/python_api/test/test_mvcc_bank.py (Jepsen Bank MVCC suite) Results: benchmarks/results/mvcc-bank-feat-concurrent-writes-20260402.md All three runs passed with 0 anomalies (single-writer, 4w multi, 8w stress).
Fixes -Werror=reorder: enableMultiWrites was initialized before autoCheckpoint in the ctor initializer list, but is declared after it in database.h. Move enableMultiWrites to the end of the list to match the declaration order (bufferPoolSize, maxNumThreads, enableCompression, readOnly, maxDBSize, autoCheckpoint, ..., enableChecksums, enableMultiWrites).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR ports Vela-Engineering/kuzu's concurrent multi-writer checkpoint implementation
into ladybug. The implementation was developed by Aikins Laryea (Vela Partners) across
4 phases plus a read-only lock fix; the full diff is 2,360 lines across 45 files.
Before: checkpoint drains all transactions (reads + writes) and blocks new writes
until the entire checkpoint completes.
After: checkpoint only drains write transactions, releases the write gate as soon
as the WAL is rotated, and reads continue concurrently via MVCC snapshot isolation.
Shadow pages are applied with per-page CAS locking so optimistic readers detect updates.
What changed
Core mechanism
.walis atomically renamed to.wal.checkpoint(frozen WAL). New write transactions immediately create a freshactive WAL, isolated from the checkpoint.
materialization runs concurrently with new writers via per-node-group locking.
Catalog::serializeSnapshot(snapshotTS)andCatalogSet::serializeSnapshot(snapshotTxn)use a pinned snapshot timestamp forconsistent serialization after the gate is released.
Table::hasChanges(bool) replaced bychangeEpoch(atomic uint64).captureChangeEpochs()snapshots per-tablewatermarks under the write gate;
Table::checkpoint()uses watermarks to skiptables that haven't changed since the last gate.
Thread safety
StorageManager::mtxupgraded from exclusivestd::mutextomutable std::shared_mutex(reads take shared lock, DDL takes unique lock).NodeTable::schemaMtxprotects thecolumnsvector during concurrentcheckpoint vacuum.
PageManager::versionandCatalog::versionusestd::atomic<uint64_t>with a stable
lastCheckpointVersionbaseline.ShadowFile::applyShadowPagesnow callsBufferManager::updateFrameIfPageIsInFrame(CAS-loop locked variant) instead of the lock-free variant.
WAL recovery
WALReplayer::replaynow handles both frozen (.wal.checkpoint) and active(
.wal) WAL files independently viareplayFrozenWAL/replayActiveWAL.Read-only databases
readOnly=true, enabling concurrent read-only opens.Apple clang compatibility
SegmentCheckpointStateandLazySegmentData.Testing
test/transaction/checkpoint_test.cppto use the newlogCheckpointAndApplyShadowPages(bool walRotated)signature, WAL rotation-awareflaky checkpointer classes, and
ShadowFileDatabaseIDMismatchExistingDBnowvalidates the frozen WAL (
.wal.checkpoint) rather than the active WAL.embed-graph-db/.github/workflows/ladybug-concurrent-writes-ci.ymlcovers build (ubuntu + macos), full ctest suite, checkpoint/concurrent-write test
filter, and benchmark comparison (blocks on >5% single-writer regression or
zero read-during-checkpoint throughput).
Checkout Latest Benchmark Run
Attribution
Concurrent checkpoint implementation: Aikins Laryea, Vela Partners
(Vela-Engineering/kuzu, commits 3ad9071..d298e9c)
Ported and adapted for ladybug (
lbugnamespace,mainStorageManagerfield,2-arg
applyShadowPagesAPI,std::format, graph catalog support) by Logan Powell.