Skip to content

flatkv cache#3027

Merged
cody-littley merged 102 commits intomainfrom
cjl/flatkv-cache
Apr 1, 2026
Merged

flatkv cache#3027
cody-littley merged 102 commits intomainfrom
cjl/flatkv-cache

Conversation

@cody-littley
Copy link
Copy Markdown
Contributor

@cody-littley cody-littley commented Mar 5, 2026

Describe your changes and provide context

Add a caching layer to FlatKV, more than doubling performance in cryptosim benchmarks.

Testing performed to validate your change

Unit tests, ran benchmark over several days.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 5, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 1, 2026, 6:27 PM

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 75.52301% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.61%. Comparing base (3ed3bf2) to head (872b0a6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sei-db/state_db/sc/flatkv/config.go 52.70% 19 Missing and 16 partials ⚠️
sei-db/db_engine/pebbledb/db.go 50.00% 18 Missing and 5 partials ⚠️
sei-db/state_db/sc/flatkv/store_write.go 83.68% 12 Missing and 11 partials ⚠️
sei-db/state_db/sc/flatkv/store.go 75.38% 8 Missing and 8 partials ⚠️
sei-db/common/threading/fixed_pool.go 77.77% 2 Missing and 2 partials ⚠️
sei-db/db_engine/dbcache/cache_config.go 63.63% 2 Missing and 2 partials ⚠️
sei-db/db_engine/pebbledb/pebbledb_config.go 60.00% 2 Missing and 2 partials ⚠️
sei-db/common/threading/adhoc_pool.go 83.33% 1 Missing and 1 partial ⚠️
sei-db/common/threading/elastic_pool.go 90.00% 1 Missing and 1 partial ⚠️
sei-db/db_engine/pebbledb/batch.go 60.00% 1 Missing and 1 partial ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3027      +/-   ##
==========================================
- Coverage   58.75%   58.61%   -0.15%     
==========================================
  Files        2095     2100       +5     
  Lines      173551   175039    +1488     
==========================================
+ Hits       101965   102594     +629     
- Misses      62465    63273     +808     
- Partials     9121     9172      +51     
Flag Coverage Δ
sei-chain-pr 73.07% <75.52%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/common/metrics/phase_timer.go 85.10% <100.00%> (+0.32%) ⬆️
sei-db/config/sc_config.go 100.00% <100.00%> (ø)
sei-db/db_engine/dbcache/cache.go 77.77% <100.00%> (+66.66%) ⬆️
sei-db/db_engine/dbcache/cache_impl.go 95.45% <100.00%> (-0.20%) ⬇️
sei-db/db_engine/dbcache/shard.go 91.81% <100.00%> (+1.63%) ⬆️
sei-db/db_engine/pebbledb/pebble_metrics.go 69.09% <100.00%> (ø)
sei-db/db_engine/pebbledb/pebbledb_test_config.go 100.00% <100.00%> (ø)
sei-db/db_engine/types/types.go 100.00% <ø> (ø)
sei-db/state_db/sc/flatkv/flatkv_test_config.go 100.00% <100.00%> (ø)
sei-db/state_db/sc/flatkv/snapshot.go 65.98% <100.00%> (+0.19%) ⬆️
... and 12 more

... and 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

return fmt.Errorf("metadata db config is invalid: %w", c.MetadataDBConfig.Validate())
}

if c.ReaderThreadsPerCore < 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ReaderThreadsPerCore == 0, and ReaderConstantThreadCount == 0, it will create a pool with 0 workers, causing deadlock issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've updated the pool constructor to more elegantly handle this case:

	if workers <= 0 {
		workers = 1
	}

if c.DataDir == "" {
return fmt.Errorf("data dir is required")
}
if c.CacheSize > 0 && (c.CacheShardCount&(c.CacheShardCount-1)) != 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also validate and make sure CacheShardCount > 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Validation added.

@yzang2019
Copy link
Copy Markdown
Contributor

One blocker comment: I think we should decouple cache from db_engine as much as possible, which allows FlatKV to switch different db_engine easily without reimplmenting the cache for each engine in the future

defer wg.Done()
errs[idx] = b.Commit(syncOpt)
}(i, p.batch)
err := s.miscPool.Submit(s.ctx, func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could a deadlock issue under load.

In commitBatches(), tasks are submitted to miscPool. Each task calls cachedBatch.Commit(), which calls cache.BatchSet(), which also submits to miscPool. If the pool queue is saturated (all workers executing batch commits), the inner BatchSet submissions will block waiting for a free slot — but slots can't free up until workers finish, and workers can't finish until BatchSet completes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good observation, but I've already taken precautions against this exact scenario. ;)

There several pool implementations:

  • fixed: N worker threads, work only runs on these threads
  • elastic: N worker threads, if worker thread is not immediately available then spin up new goroutine
  • adhoc: each job gets its own goroutine (for unit tests, mostly)

The misc pool is using an elastic pool. This means that it's safe to send tasks with blocking dependencies, without fear of deadlock.

miscPool := threading.NewElasticPool(ctx, "flatkv-misc", miscPoolSize)

In the elastic pool's implementation, the work queue is a channel of size 0, meaning that if there isn't a worker currently sitting idle, we will immediately fall through to the default case in the select statement.

func (ep *elasticPool) Submit(ctx context.Context, task func()) error {
	if task == nil {
		return fmt.Errorf("elastic pool: nil task")
	}
	select {
	case <-ctx.Done():
		return ctx.Err()
	case <-ep.ctx.Done():
		return fmt.Errorf("elastic pool is shut down")
	case ep.workQueue <- task:
		return nil
	default:
		// All warm workers are busy; spawn a temporary goroutine.
		go task()
		return nil
	}
}

}(i, db)
err := s.miscPool.Submit(s.ctx, func() {
errs[i] = db.Flush()
wg.Done()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add defer for wg.Done() to avoid panic hangs forever

Copy link
Copy Markdown
Contributor Author

@cody-littley cody-littley Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made.

return fmt.Errorf("invalid address length %d for key kind %d", len(keyBytes), kind)
}
addrStr := string(addr[:])
addrKey := string(AccountKey(addr))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone ever changes AccountKey to add a prefix, or any transformation (which is a natural evolution for a "DB key builder" function), batchReadOldValues would silently fail to find pending writes in s.accountWrites

Suggest to pick one canonical key representation and use it everywhere — either always string(addr[:]) or always string(AccountKey(addr)).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if !ok {
continue
}
k := string(AccountKey(addr))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using string(addr[:]) for the s.accountWrites lookup (matching ApplyChangeSets and store_read.go), and a separate addrKey := string(AccountKey(addr)) for the accountOld/accountBatch maps. Today AccountKey returns addr[:] so the result is the same, but using two different derivations for the same map key is fragile if AccountKey ever changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change made

return fmt.Errorf("metadata db config is invalid: %w", err)
}

if c.ReaderThreadsPerCore < 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: <=0 to match error text

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

defer wg.Done()
storageErr = s.storageDB.BatchGet(storageBatch)
})
if err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a later Submit fails and we return early, previously submitted tasks may still be running and writing to their batch maps? how about calling wg.Wait() before returning on submit error to avoid a race.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, although I think we can fix this with a slightly simpler solution.

Submit can only fail when contexts get cancelled, i.e. we should only expect to encounter this sort of error during system teardown workflows. So I don't think it's important to optimize performance here, just to make sure it's functionally correct.

The problem is that when we return, we're returning garbage map data. And even worse, we're returning garbage data that isn't threadsafe.

I don't think we need to block until all goroutines are finished. I think the important part is to just always return nil values if we are returning an error. It's ok of the goroutine doesn't immediately stop, as long as the caller isn't receiving unsafe/invalid data.

I've converted error return cases to use the following form:

		if err != nil {
			return nil, nil, nil, nil, fmt.Errorf("failed to submit batch get: %w", err)
		}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the batchReadOldValues fix.

in func commitBatches(), there is a similar issue and I’d still lean toward switching back to plain goroutines, since there’s no returned map state to null out there. the main issue in that path is just the partial-submit/unwind interaction, and direct goroutines seem like the simplest way to avoid that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, I've changed the function signature so that Submit() never returns an error.

return fmt.Errorf("LogDir is empty, refusing to proceed")
}

if cfg.DeleteDataDirOnStartup {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cmd/cryptosim DeleteDataDirOnStartup deletes DataDir, and in cmd/configure-logger it deletes LogDir, enabling this flag now has inconsistent

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blame @masih for this one, configuring the logger at init time makes this sort of workflow wonky. 😜

I've added seperate configurations for deleting log dirs, so now it should be easier to grok how the settings control the workflow.

}

// Validate checks that the configuration is sane and returns an error if it is not.
func (c *CacheConfig) Validate() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function being used now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, they weren't being called! I've added calls to these methods inside flatKV config.

	if err := c.AccountCacheConfig.Validate(); err != nil {
		return fmt.Errorf("account cache config is invalid: %w", err)
	}
	if err := c.CodeCacheConfig.Validate(); err != nil {
		return fmt.Errorf("code cache config is invalid: %w", err)
	}
	if err := c.StorageCacheConfig.Validate(); err != nil {
		return fmt.Errorf("storage cache config is invalid: %w", err)
	}
	if err := c.LegacyCacheConfig.Validate(); err != nil {
		return fmt.Errorf("legacy cache config is invalid: %w", err)
	}
	if err := c.MetadataCacheConfig.Validate(); err != nil {
		return fmt.Errorf("metadata cache config is invalid: %w", err)
	}

@cody-littley cody-littley added this pull request to the merge queue Apr 1, 2026
Merged via the queue into main with commit 232fee5 Apr 1, 2026
39 checks passed
@cody-littley cody-littley deleted the cjl/flatkv-cache branch April 1, 2026 18:58
yzang2019 added a commit that referenced this pull request Apr 1, 2026
* main:
  plt-228 fixed static check on app and evmrpc package (#3154)
  flatkv cache (#3027)
  Make cryptosim state store backend configurable + No Op Wrapper + Read Disable Config (#3145)
  Add warning message for IAVL deprecation (#3159)
  Change default min valid per window to zero (#3157)
  support for starting autobahn from non-zero global block (#3136)
  Fix upgrade list comparison to respect semver (#3153)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants