Skip to content

persist blocks and FullCommitQCs in data layer via WAL (CON-231)#3126

Merged
wen-coding merged 2 commits intomainfrom
wen/persist_data
Apr 8, 2026
Merged

persist blocks and FullCommitQCs in data layer via WAL (CON-231)#3126
wen-coding merged 2 commits intomainfrom
wen/persist_data

Conversation

@wen-coding
Copy link
Copy Markdown
Contributor

@wen-coding wen-coding commented Mar 27, 2026

Summary

  • Add GlobalBlockPersister and FullCommitQCPersister backed by indexedWAL for data-layer crash recovery
  • Group into DataWAL struct; NewState takes DataWAL for crash recovery
  • NewState verifies all loaded data via insertQC/insertBlock (signatures + hashes), treating WAL data as untrusted. Uses blocks as golden source for inner.first via skipTo(max(blocksFirst, qcFirst))
  • DataWAL.reconcile() fixes WAL cursor inconsistencies at construction: prefix alignment (crash between parallel truncations) and tail trimming via TruncateAfter (blocks persisted without QCs). Loaded block data is trimmed in-place
  • Recovery handles partially pruned QC ranges: QCs whose range starts before inner.first are accepted for block verification without moving first backward
  • Add TruncateWhile and TruncateAfter to indexedWAL. TruncateAfter uses exclusive semantics (removes entries >= n)
  • TruncateBefore handles walIdx == nextIdx (truncate all, skip verify); walIdx > nextIdx errors
  • Async persistence: PushQC/PushBlock are pure in-memory — errors only from verification. Background runPersist writes QCs (eagerly up to nextQC) and blocks (up to nextBlock) in parallel via scope.Parallel. Persistence errors propagate vertically via Run()
  • nextBlockToPersist cursor advances to min(persistedQC, persistedBlock). PushAppHash (now takes ctx) waits on this cursor, ensuring AppVotes are only issued for persisted data
  • Parallel WAL truncation in DataWAL.TruncateBefore via scope.Parallel
  • Simple per-block pruning: PruneBefore(retainFrom) prunes per-block with +1 keep-last guard. May split QC ranges; handled on recovery. No QC-boundary awareness needed in pruning code
  • insertBlock shared by PushQC, PushBlock, and NewState; does not advance nextBlock (callers batch then call updateNextBlock)
  • Block contiguity check in NewState loading loop (defense in depth)
  • Removed unused BlockStore interface; renamed to FullCommitQCPersister / fullcommitqcs dir
  • Updated giga_router.go for new NewState/PushAppHash signatures

Test plan

  • globalblocks_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, continue after reload, truncate after (middle with loaded trim, no-op, before first). All use randomized FirstBlock
  • fullcommitqcs_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, mid-range truncation, continue after reload
  • wal_test.go: TruncateWhile (empty, none match, partial, all, reopen); TruncateAfter (middle, last, before first, reopen); TruncateBefore past end errors
  • state_test.go:
    • TestStateRecoveryFromWAL — full recovery; third restart verifies WALs not wiped
    • TestStateRecoveryBlocksOnly — QCs WAL lost, blocks re-pushed with QC
    • TestStateRecoveryQCsOnly — blocks WAL lost, cursor sync via reconcile
    • TestStateRecoveryAfterPruning — both WALs truncated, only tail survives
    • TestStateRecoverySkipsStaleBlocks — blocks before first QC range ignored
    • TestStateRecoveryBlocksBehindQCs — QCs ahead of blocks, gap re-fetched
    • TestStateRecoveryIgnoresBlocksBeyondQC — blocks beyond QC range ignored
    • TestReconcileTruncatesBlocksTail — stale blocks past QCs trimmed on startup
    • TestRecoveryWithPartialQCPrefix — partial QC prefix from per-block pruning; blocks as golden
    • TestPruningKeepsLastQCRange — pruning never empties state; restart recovers
    • TestPruningWithPartialQCRange — per-block pruning splits QC range; recovery handles it
    • TestRunPruningEmptyState — no panic on first startup with no data
    • TestStateRejectsBlockGapInWAL — corrupt WAL with block gap detected
    • TestExecution — async persistence + PushAppHash wait semantics
  • All existing tests pass (data, avail, consensus, p2p/giga)
  • gofmt and go vet clean

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 7, 2026, 6:25 PM

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 73.67206% with 114 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.78%. Comparing base (2b25de6) to head (383ecaf).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
sei-tendermint/internal/autobahn/data/state.go 72.34% 29 Missing and 23 partials ⚠️
...nternal/autobahn/consensus/persist/globalblocks.go 77.77% 18 Missing and 8 partials ⚠️
...ternal/autobahn/consensus/persist/fullcommitqcs.go 78.75% 12 Missing and 5 partials ⚠️
...dermint/internal/autobahn/consensus/persist/wal.go 70.58% 5 Missing and 5 partials ⚠️
sei-tendermint/internal/p2p/giga_router.go 20.00% 4 Missing and 4 partials ⚠️
sei-tendermint/internal/autobahn/data/testonly.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3126      +/-   ##
==========================================
+ Coverage   58.72%   58.78%   +0.05%     
==========================================
  Files        2055     2058       +3     
  Lines      168494   168876     +382     
==========================================
+ Hits        98955    99275     +320     
- Misses      60745    60775      +30     
- Partials     8794     8826      +32     
Flag Coverage Δ
sei-chain-pr 76.29% <73.67%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...nt/internal/autobahn/consensus/persist/testonly.go 100.00% <100.00%> (ø)
sei-tendermint/internal/autobahn/data/testonly.go 61.32% <0.00%> (ø)
sei-tendermint/internal/p2p/giga_router.go 65.95% <20.00%> (-2.94%) ⬇️
...dermint/internal/autobahn/consensus/persist/wal.go 68.18% <70.58%> (+4.24%) ⬆️
...ternal/autobahn/consensus/persist/fullcommitqcs.go 78.75% <78.75%> (ø)
...nternal/autobahn/consensus/persist/globalblocks.go 77.77% <77.77%> (ø)
sei-tendermint/internal/autobahn/data/state.go 76.47% <72.34%> (+9.16%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wen-coding wen-coding changed the title persist FullCommitQCs in data layer via WAL persist blocks and FullCommitQCs in data layer via WAL Mar 27, 2026
@wen-coding wen-coding force-pushed the wen/persist_data branch 2 times, most recently from 73df12b to 2613700 Compare March 31, 2026 00:16
@wen-coding wen-coding force-pushed the wen/persist_data branch 2 times, most recently from e24ad7d to 9c79a0c Compare March 31, 2026 21:34
@wen-coding wen-coding requested a review from pompon0 March 31, 2026 21:36
@wen-coding wen-coding force-pushed the wen/persist_data branch 2 times, most recently from bbfaa38 to 7b9f6b2 Compare April 2, 2026 20:41
pruningTime := time.Now()
for inner, ctrl := range s.inner.Lock() {
for inner.first < min(n, inner.nextAppProposal) {
target := inner.findPruneBoundary(s.cfg.Committee, func(qcEnd types.GlobalBlockNumber) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe it is just me, but I find it hard to read.
pruneBefore := i.qsc[min(n,nextAppProposal)-1].First with appropriate overflow checks should do, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the problem for me is that "boundary" name is vague, and so is "end" (is it inclusive/exclusive)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it would be easier to move the pruning boundary adjustment to DataWAL - it can round down the pruning to the boundary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR alternatively, we could prune as previously, but when loading we load qcs only from the first block forward (i.e. we do not load the qc for the block numbers before the first block, even if the first block in storage was not the first block of the qc global range). This way we do not create a gap at the beginning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified a bit, does this look better?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does, definitely more readable, thanks. I still have some simplifications in mind, but I'll just experiment with this later myself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I thought about your proposals as well:

  • moving boundary adjustment to DataWAL: this would be similar logic like now, just in different location. I'm a bit hesitant moving it to DataWAL because if we try to hide persistence internals, it does mean block WAL need to be aware of qc WAL. Not sure how big a concern it is giving we are moving to new storage solution though. It is also a bit weird we refuse to serve block X but suddenly after restart we can, but that's cosmetic complaint.

  • prune normally but fix the loading logic: I just finds it a bit weird that at the tail, QCs can arrive before blocks, while at the head, we use blocks to define where the usable data starts. But I can live with this asymmetry.

Changed to prune normally but fix the loading logic, does this look simpler?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant moving it to DataWAL because if we try to hide persistence internals, it does mean block WAL need to be aware of qc WAL

DataWAL wraps both block WAL and qc WAL, I imagine all the inconsistencies can be resolved within this common wrapper, not any of the internal WALs, no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pruning simplification looks neat, at least to me, thanks!

@wen-coding wen-coding force-pushed the wen/persist_data branch 2 times, most recently from af94d9a to 8e235c6 Compare April 3, 2026 18:15
// If WAL data starts past committee.FirstBlock() (due to pruning in a
// previous run), fast-forward all cursors to where data actually starts.
qcFirst := dataWAL.CommitQCs.LoadedFirst()
if qcFirst > cfg.Committee.FirstBlock() {
Copy link
Copy Markdown
Contributor

@pompon0 pompon0 Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need for skipTo to be conditional, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we plan to use the first block argument?

Could it be that we purged to block 100, but somehow we decided to restart everyone at block 105?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First block is constant per epoch. I have not planned for using it for coordinated hard forks yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I thought that skipTo is a noop if qcFirst <= cfg.Committee.FirstBlock(), but it is not.

return fmt.Errorf("qc.Verify(): %w", err)
}
gr := qc.QC().GlobalRange(committee)
if gr.First != i.nextQC {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataWAL still doesn't normalize loaded data, so this will fail if loaded blocks do not match loaded QCs, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify a little bit what you mean by "do not match"?

  • The DataWAL construction is dumb loading everything on disk, normalize happens inside reconcile()
  • QC is the ultimate truth, so we verify and load QCs first (what we are doing here)
  • Then after QCs are in place, we verify and load blocks, the blocks outside QC range should be skipped
  • Having QCs without matching blocks happen in production, so it's expected

Hmm, do you mean we didn't check for block contiguity? The current persister guarantees block contiguity on writing, but I guess we can add a defense in depth here, changed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean that loading data happens AFTER prefix reconciliation, so it is possible that gr.First < i.nextQC here, in case more blocks were pruned than QCs in case of a crash.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see now that NewState is dropping the non-reconciled part of the loaded state (I might have missed that in the previous review, sorry).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
if gr.Next <= inner.first {
continue // fully before first, skip
}
if gr.First < inner.first {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case can be merged into insertQC afaict

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, done.

@pompon0 pompon0 self-requested a review April 7, 2026 14:16
insertQC now accepts QCs whose range starts before nextQC (partially
pruned prefix silently skipped). This removes duplicated QC insertion
logic from NewState's recovery loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wen-coding wen-coding changed the title persist blocks and FullCommitQCs in data layer via WAL persist blocks and FullCommitQCs in data layer via WAL (CON-231) Apr 8, 2026
@wen-coding wen-coding added this pull request to the merge queue Apr 8, 2026
Merged via the queue into main with commit 289bc61 Apr 8, 2026
40 checks passed
@wen-coding wen-coding deleted the wen/persist_data branch April 8, 2026 21:03
yzang2019 added a commit that referenced this pull request Apr 9, 2026
* main:
  Add receipt / log reads to cryptosim (#3081)
  persist blocks and FullCommitQCs in data layer via WAL (CON-231) (#3126)
  Update Changelog in prep to cut v6.4.1 (#3213)
  fix(sei-tendermint): resolve staticcheck warnings (#3207)
  Add historical state offload stream hook (#3183)
  feat: wire autobahn config propagation from top-level to GigaRouter (CON-232) (#3194)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants