persist blocks and FullCommitQCs in data layer via WAL (CON-231)#3126
persist blocks and FullCommitQCs in data layer via WAL (CON-231)#3126wen-coding merged 2 commits intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3126 +/- ##
==========================================
+ Coverage 58.72% 58.78% +0.05%
==========================================
Files 2055 2058 +3
Lines 168494 168876 +382
==========================================
+ Hits 98955 99275 +320
- Misses 60745 60775 +30
- Partials 8794 8826 +32
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
73df12b to
2613700
Compare
sei-tendermint/internal/autobahn/consensus/persist/globalblocks.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/globalblocks.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/globalcommitqcs.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/globalcommitqcs.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/globalcommitqcs.go
Outdated
Show resolved
Hide resolved
e24ad7d to
9c79a0c
Compare
sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs_test.go
Outdated
Show resolved
Hide resolved
sei-tendermint/internal/autobahn/consensus/persist/globalblocks.go
Outdated
Show resolved
Hide resolved
bbfaa38 to
7b9f6b2
Compare
| pruningTime := time.Now() | ||
| for inner, ctrl := range s.inner.Lock() { | ||
| for inner.first < min(n, inner.nextAppProposal) { | ||
| target := inner.findPruneBoundary(s.cfg.Committee, func(qcEnd types.GlobalBlockNumber) bool { |
There was a problem hiding this comment.
nit: maybe it is just me, but I find it hard to read.
pruneBefore := i.qsc[min(n,nextAppProposal)-1].First with appropriate overflow checks should do, right?
There was a problem hiding this comment.
actually the problem for me is that "boundary" name is vague, and so is "end" (is it inclusive/exclusive)
There was a problem hiding this comment.
I think, it would be easier to move the pruning boundary adjustment to DataWAL - it can round down the pruning to the boundary.
There was a problem hiding this comment.
OR alternatively, we could prune as previously, but when loading we load qcs only from the first block forward (i.e. we do not load the qc for the block numbers before the first block, even if the first block in storage was not the first block of the qc global range). This way we do not create a gap at the beginning.
There was a problem hiding this comment.
Simplified a bit, does this look better?
There was a problem hiding this comment.
it does, definitely more readable, thanks. I still have some simplifications in mind, but I'll just experiment with this later myself.
There was a problem hiding this comment.
Sure, I thought about your proposals as well:
-
moving boundary adjustment to DataWAL: this would be similar logic like now, just in different location. I'm a bit hesitant moving it to DataWAL because if we try to hide persistence internals, it does mean block WAL need to be aware of qc WAL. Not sure how big a concern it is giving we are moving to new storage solution though. It is also a bit weird we refuse to serve block X but suddenly after restart we can, but that's cosmetic complaint.
-
prune normally but fix the loading logic: I just finds it a bit weird that at the tail, QCs can arrive before blocks, while at the head, we use blocks to define where the usable data starts. But I can live with this asymmetry.
Changed to prune normally but fix the loading logic, does this look simpler?
There was a problem hiding this comment.
I'm a bit hesitant moving it to DataWAL because if we try to hide persistence internals, it does mean block WAL need to be aware of qc WAL
DataWAL wraps both block WAL and qc WAL, I imagine all the inconsistencies can be resolved within this common wrapper, not any of the internal WALs, no?
There was a problem hiding this comment.
the pruning simplification looks neat, at least to me, thanks!
af94d9a to
8e235c6
Compare
| // If WAL data starts past committee.FirstBlock() (due to pruning in a | ||
| // previous run), fast-forward all cursors to where data actually starts. | ||
| qcFirst := dataWAL.CommitQCs.LoadedFirst() | ||
| if qcFirst > cfg.Committee.FirstBlock() { |
There was a problem hiding this comment.
there is no need for skipTo to be conditional, right?
There was a problem hiding this comment.
How do we plan to use the first block argument?
Could it be that we purged to block 100, but somehow we decided to restart everyone at block 105?
There was a problem hiding this comment.
First block is constant per epoch. I have not planned for using it for coordinated hard forks yet.
There was a problem hiding this comment.
oh, I thought that skipTo is a noop if qcFirst <= cfg.Committee.FirstBlock(), but it is not.
| return fmt.Errorf("qc.Verify(): %w", err) | ||
| } | ||
| gr := qc.QC().GlobalRange(committee) | ||
| if gr.First != i.nextQC { |
There was a problem hiding this comment.
DataWAL still doesn't normalize loaded data, so this will fail if loaded blocks do not match loaded QCs, right?
There was a problem hiding this comment.
Can you clarify a little bit what you mean by "do not match"?
- The DataWAL construction is dumb loading everything on disk, normalize happens inside reconcile()
- QC is the ultimate truth, so we verify and load QCs first (what we are doing here)
- Then after QCs are in place, we verify and load blocks, the blocks outside QC range should be skipped
- Having QCs without matching blocks happen in production, so it's expected
Hmm, do you mean we didn't check for block contiguity? The current persister guarantees block contiguity on writing, but I guess we can add a defense in depth here, changed.
There was a problem hiding this comment.
what I mean that loading data happens AFTER prefix reconciliation, so it is possible that gr.First < i.nextQC here, in case more blocks were pruned than QCs in case of a crash.
There was a problem hiding this comment.
ok, I see now that NewState is dropping the non-reconciled part of the loaded state (I might have missed that in the previous review, sorry).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8c6d985 to
e64ad5f
Compare
| if gr.Next <= inner.first { | ||
| continue // fully before first, skip | ||
| } | ||
| if gr.First < inner.first { |
There was a problem hiding this comment.
this case can be merged into insertQC afaict
There was a problem hiding this comment.
That's a good point, done.
insertQC now accepts QCs whose range starts before nextQC (partially pruned prefix silently skipped). This removes duplicated QC insertion logic from NewState's recovery loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* main: Add receipt / log reads to cryptosim (#3081) persist blocks and FullCommitQCs in data layer via WAL (CON-231) (#3126) Update Changelog in prep to cut v6.4.1 (#3213) fix(sei-tendermint): resolve staticcheck warnings (#3207) Add historical state offload stream hook (#3183) feat: wire autobahn config propagation from top-level to GigaRouter (CON-232) (#3194)
Summary
GlobalBlockPersisterandFullCommitQCPersisterbacked byindexedWALfor data-layer crash recoveryDataWALstruct;NewStatetakesDataWALfor crash recoveryNewStateverifies all loaded data viainsertQC/insertBlock(signatures + hashes), treating WAL data as untrusted. Uses blocks as golden source forinner.firstviaskipTo(max(blocksFirst, qcFirst))DataWAL.reconcile()fixes WAL cursor inconsistencies at construction: prefix alignment (crash between parallel truncations) and tail trimming viaTruncateAfter(blocks persisted without QCs). Loaded block data is trimmed in-placeinner.firstare accepted for block verification without movingfirstbackwardTruncateWhileandTruncateAftertoindexedWAL.TruncateAfteruses exclusive semantics (removes entries >= n)TruncateBeforehandleswalIdx == nextIdx(truncate all, skip verify);walIdx > nextIdxerrorsPushQC/PushBlockare pure in-memory — errors only from verification. BackgroundrunPersistwrites QCs (eagerly up tonextQC) and blocks (up tonextBlock) in parallel viascope.Parallel. Persistence errors propagate vertically viaRun()nextBlockToPersistcursor advances tomin(persistedQC, persistedBlock).PushAppHash(now takesctx) waits on this cursor, ensuring AppVotes are only issued for persisted dataDataWAL.TruncateBeforeviascope.ParallelPruneBefore(retainFrom)prunes per-block with+1keep-last guard. May split QC ranges; handled on recovery. No QC-boundary awareness needed in pruning codeinsertBlockshared byPushQC,PushBlock, andNewState; does not advancenextBlock(callers batch then callupdateNextBlock)NewStateloading loop (defense in depth)BlockStoreinterface; renamed toFullCommitQCPersister/fullcommitqcsdirgiga_router.gofor newNewState/PushAppHashsignaturesTest plan
globalblocks_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, continue after reload, truncate after (middle with loaded trim, no-op, before first). All use randomizedFirstBlockfullcommitqcs_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, mid-range truncation, continue after reloadwal_test.go:TruncateWhile(empty, none match, partial, all, reopen);TruncateAfter(middle, last, before first, reopen);TruncateBeforepast end errorsstate_test.go:TestStateRecoveryFromWAL— full recovery; third restart verifies WALs not wipedTestStateRecoveryBlocksOnly— QCs WAL lost, blocks re-pushed with QCTestStateRecoveryQCsOnly— blocks WAL lost, cursor sync via reconcileTestStateRecoveryAfterPruning— both WALs truncated, only tail survivesTestStateRecoverySkipsStaleBlocks— blocks before first QC range ignoredTestStateRecoveryBlocksBehindQCs— QCs ahead of blocks, gap re-fetchedTestStateRecoveryIgnoresBlocksBeyondQC— blocks beyond QC range ignoredTestReconcileTruncatesBlocksTail— stale blocks past QCs trimmed on startupTestRecoveryWithPartialQCPrefix— partial QC prefix from per-block pruning; blocks as goldenTestPruningKeepsLastQCRange— pruning never empties state; restart recoversTestPruningWithPartialQCRange— per-block pruning splits QC range; recovery handles itTestRunPruningEmptyState— no panic on first startup with no dataTestStateRejectsBlockGapInWAL— corrupt WAL with block gap detectedTestExecution— async persistence + PushAppHash wait semanticsdata,avail,consensus,p2p/giga)🤖 Generated with Claude Code