persist blocks and FullCommitQCs in data layer via WAL (CON-231) by wen-coding · Pull Request #3126 · sei-protocol/sei-chain

wen-coding · 2026-03-27T13:10:48Z

Summary

Add GlobalBlockPersister and FullCommitQCPersister backed by indexedWAL for data-layer crash recovery
Group into DataWAL struct; NewState takes DataWAL for crash recovery
NewState verifies all loaded data via insertQC/insertBlock (signatures + hashes), treating WAL data as untrusted. Uses blocks as golden source for inner.first via skipTo(max(blocksFirst, qcFirst))
DataWAL.reconcile() fixes WAL cursor inconsistencies at construction: prefix alignment (crash between parallel truncations) and tail trimming via TruncateAfter (blocks persisted without QCs). Loaded block data is trimmed in-place
Recovery handles partially pruned QC ranges: QCs whose range starts before inner.first are accepted for block verification without moving first backward
Add TruncateWhile and TruncateAfter to indexedWAL. TruncateAfter uses exclusive semantics (removes entries >= n)
TruncateBefore handles walIdx == nextIdx (truncate all, skip verify); walIdx > nextIdx errors
Async persistence: PushQC/PushBlock are pure in-memory — errors only from verification. Background runPersist writes QCs (eagerly up to nextQC) and blocks (up to nextBlock) in parallel via scope.Parallel. Persistence errors propagate vertically via Run()
nextBlockToPersist cursor advances to min(persistedQC, persistedBlock). PushAppHash (now takes ctx) waits on this cursor, ensuring AppVotes are only issued for persisted data
Parallel WAL truncation in DataWAL.TruncateBefore via scope.Parallel
Simple per-block pruning: PruneBefore(retainFrom) prunes per-block with +1 keep-last guard. May split QC ranges; handled on recovery. No QC-boundary awareness needed in pruning code
insertBlock shared by PushQC, PushBlock, and NewState; does not advance nextBlock (callers batch then call updateNextBlock)
Block contiguity check in NewState loading loop (defense in depth)
Removed unused BlockStore interface; renamed to FullCommitQCPersister / fullcommitqcs dir
Updated giga_router.go for new NewState/PushAppHash signatures

Test plan

globalblocks_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, continue after reload, truncate after (middle with loaded trim, no-op, before first). All use randomized FirstBlock
fullcommitqcs_test.go: persist & reload, truncate & reload, truncate all, no-op, duplicate ignored, gap error, mid-range truncation, continue after reload
wal_test.go: TruncateWhile (empty, none match, partial, all, reopen); TruncateAfter (middle, last, before first, reopen); TruncateBefore past end errors
state_test.go:
- TestStateRecoveryFromWAL — full recovery; third restart verifies WALs not wiped
- TestStateRecoveryBlocksOnly — QCs WAL lost, blocks re-pushed with QC
- TestStateRecoveryQCsOnly — blocks WAL lost, cursor sync via reconcile
- TestStateRecoveryAfterPruning — both WALs truncated, only tail survives
- TestStateRecoverySkipsStaleBlocks — blocks before first QC range ignored
- TestStateRecoveryBlocksBehindQCs — QCs ahead of blocks, gap re-fetched
- TestStateRecoveryIgnoresBlocksBeyondQC — blocks beyond QC range ignored
- TestReconcileTruncatesBlocksTail — stale blocks past QCs trimmed on startup
- TestRecoveryWithPartialQCPrefix — partial QC prefix from per-block pruning; blocks as golden
- TestPruningKeepsLastQCRange — pruning never empties state; restart recovers
- TestPruningWithPartialQCRange — per-block pruning splits QC range; recovery handles it
- TestRunPruningEmptyState — no panic on first startup with no data
- TestStateRejectsBlockGapInWAL — corrupt WAL with block gap detected
- TestExecution — async persistence + PushAppHash wait semantics
All existing tests pass (data, avail, consensus, p2p/giga)
gofmt and go vet clean

🤖 Generated with Claude Code

github-actions · 2026-03-27T13:11:46Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Apr 7, 2026, 6:25 PM

codecov · 2026-03-27T13:12:21Z

Codecov Report

❌ Patch coverage is 73.67206% with 114 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.78%. Comparing base (2b25de6) to head (383ecaf).
⚠️ Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
sei-tendermint/internal/autobahn/data/state.go	72.34%	29 Missing and 23 partials ⚠️
...nternal/autobahn/consensus/persist/globalblocks.go	77.77%	18 Missing and 8 partials ⚠️
...ternal/autobahn/consensus/persist/fullcommitqcs.go	78.75%	12 Missing and 5 partials ⚠️
...dermint/internal/autobahn/consensus/persist/wal.go	70.58%	5 Missing and 5 partials ⚠️
sei-tendermint/internal/p2p/giga_router.go	20.00%	4 Missing and 4 partials ⚠️
sei-tendermint/internal/autobahn/data/testonly.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3126      +/-   ##
==========================================
+ Coverage   58.72%   58.78%   +0.05%     
==========================================
  Files        2055     2058       +3     
  Lines      168494   168876     +382     
==========================================
+ Hits        98955    99275     +320     
- Misses      60745    60775      +30     
- Partials     8794     8826      +32

Flag	Coverage Δ
sei-chain-pr	`76.29% <73.67%> (?)`
sei-db	`70.41% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...nt/internal/autobahn/consensus/persist/testonly.go	`100.00% <100.00%> (ø)`
sei-tendermint/internal/autobahn/data/testonly.go	`61.32% <0.00%> (ø)`
sei-tendermint/internal/p2p/giga_router.go	`65.95% <20.00%> (-2.94%)`	⬇️
...dermint/internal/autobahn/consensus/persist/wal.go	`68.18% <70.58%> (+4.24%)`	⬆️
...ternal/autobahn/consensus/persist/fullcommitqcs.go	`78.75% <78.75%> (ø)`
...nternal/autobahn/consensus/persist/globalblocks.go	`77.77% <77.77%> (ø)`
sei-tendermint/internal/autobahn/data/state.go	`76.47% <72.34%> (+9.16%)`	⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sei-tendermint/internal/autobahn/data/state.go

sei-tendermint/internal/autobahn/consensus/persist/globalblocks.go

sei-tendermint/internal/autobahn/consensus/persist/globalcommitqcs.go

sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs.go

sei-tendermint/internal/autobahn/consensus/persist/globalcommitqcs.go

sei-tendermint/internal/autobahn/data/state.go

sei-tendermint/internal/autobahn/consensus/persist/wal.go

sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs.go

sei-tendermint/internal/autobahn/consensus/persist/fullcommitqcs_test.go

sei-tendermint/internal/autobahn/consensus/persist/globalblocks.go

sei-tendermint/internal/autobahn/data/state.go

pompon0 · 2026-04-03T16:49:10Z

sei-tendermint/internal/autobahn/data/state.go

 	pruningTime := time.Now()
 	for inner, ctrl := range s.inner.Lock() {
-		for inner.first < min(n, inner.nextAppProposal) {
+		target := inner.findPruneBoundary(s.cfg.Committee, func(qcEnd types.GlobalBlockNumber) bool {


nit: maybe it is just me, but I find it hard to read.
pruneBefore := i.qsc[min(n,nextAppProposal)-1].First with appropriate overflow checks should do, right?

actually the problem for me is that "boundary" name is vague, and so is "end" (is it inclusive/exclusive)

I think, it would be easier to move the pruning boundary adjustment to DataWAL - it can round down the pruning to the boundary.

OR alternatively, we could prune as previously, but when loading we load qcs only from the first block forward (i.e. we do not load the qc for the block numbers before the first block, even if the first block in storage was not the first block of the qc global range). This way we do not create a gap at the beginning.

Simplified a bit, does this look better?

it does, definitely more readable, thanks. I still have some simplifications in mind, but I'll just experiment with this later myself.

Sure, I thought about your proposals as well:

moving boundary adjustment to DataWAL: this would be similar logic like now, just in different location. I'm a bit hesitant moving it to DataWAL because if we try to hide persistence internals, it does mean block WAL need to be aware of qc WAL. Not sure how big a concern it is giving we are moving to new storage solution though. It is also a bit weird we refuse to serve block X but suddenly after restart we can, but that's cosmetic complaint.

prune normally but fix the loading logic: I just finds it a bit weird that at the tail, QCs can arrive before blocks, while at the head, we use blocks to define where the usable data starts. But I can live with this asymmetry.

Changed to prune normally but fix the loading logic, does this look simpler?

I'm a bit hesitant moving it to DataWAL because if we try to hide persistence internals, it does mean block WAL need to be aware of qc WAL

DataWAL wraps both block WAL and qc WAL, I imagine all the inconsistencies can be resolved within this common wrapper, not any of the internal WALs, no?

the pruning simplification looks neat, at least to me, thanks!

pompon0 · 2026-04-06T13:23:53Z

sei-tendermint/internal/autobahn/data/state.go

+	// If WAL data starts past committee.FirstBlock() (due to pruning in a
+	// previous run), fast-forward all cursors to where data actually starts.
+	qcFirst := dataWAL.CommitQCs.LoadedFirst()
+	if qcFirst > cfg.Committee.FirstBlock() {


there is no need for skipTo to be conditional, right?

How do we plan to use the first block argument?

Could it be that we purged to block 100, but somehow we decided to restart everyone at block 105?

First block is constant per epoch. I have not planned for using it for coordinated hard forks yet.

oh, I thought that skipTo is a noop if qcFirst <= cfg.Committee.FirstBlock(), but it is not.

sei-tendermint/internal/autobahn/data/state.go

pompon0 · 2026-04-06T13:34:51Z

sei-tendermint/internal/autobahn/data/state.go

+		return fmt.Errorf("qc.Verify(): %w", err)
 	}
+	gr := qc.QC().GlobalRange(committee)
+	if gr.First != i.nextQC {


DataWAL still doesn't normalize loaded data, so this will fail if loaded blocks do not match loaded QCs, right?

Can you clarify a little bit what you mean by "do not match"?

The DataWAL construction is dumb loading everything on disk, normalize happens inside reconcile()

QC is the ultimate truth, so we verify and load QCs first (what we are doing here)

Then after QCs are in place, we verify and load blocks, the blocks outside QC range should be skipped

Having QCs without matching blocks happen in production, so it's expected

Hmm, do you mean we didn't check for block contiguity? The current persister guarantees block contiguity on writing, but I guess we can add a defense in depth here, changed.

what I mean that loading data happens AFTER prefix reconciliation, so it is possible that gr.First < i.nextQC here, in case more blocks were pruned than QCs in case of a crash.

ok, I see now that NewState is dropping the non-reconciled part of the loaded state (I might have missed that in the previous review, sorry).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pompon0 · 2026-04-07T14:06:14Z

sei-tendermint/internal/autobahn/data/state.go

+		if gr.Next <= inner.first {
+			continue // fully before first, skip
+		}
+		if gr.First < inner.first {


this case can be merged into insertQC afaict

That's a good point, done.

insertQC now accepts QCs whose range starts before nextQC (partially pruned prefix silently skipped). This removes duplicated QC insertion logic from NewState's recovery loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* main: Add receipt / log reads to cryptosim (#3081) persist blocks and FullCommitQCs in data layer via WAL (CON-231) (#3126) Update Changelog in prep to cut v6.4.1 (#3213) fix(sei-tendermint): resolve staticcheck warnings (#3207) Add historical state offload stream hook (#3183) feat: wire autobahn config propagation from top-level to GigaRouter (CON-232) (#3194)

wen-coding added the non-app-hash-breaking label Mar 27, 2026

wen-coding changed the title ~~persist FullCommitQCs in data layer via WAL~~ persist blocks and FullCommitQCs in data layer via WAL Mar 27, 2026

wen-coding force-pushed the wen/persist_data branch 2 times, most recently from 73df12b to 2613700 Compare March 31, 2026 00:16