Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3027 +/- ##
==========================================
- Coverage 58.75% 58.61% -0.15%
==========================================
Files 2095 2100 +5
Lines 173551 175039 +1488
==========================================
+ Hits 101965 102594 +629
- Misses 62465 63273 +808
- Partials 9121 9172 +51
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
sei-db/state_db/sc/flatkv/config.go
Outdated
| return fmt.Errorf("metadata db config is invalid: %w", c.MetadataDBConfig.Validate()) | ||
| } | ||
|
|
||
| if c.ReaderThreadsPerCore < 0 { |
There was a problem hiding this comment.
If ReaderThreadsPerCore == 0, and ReaderConstantThreadCount == 0, it will create a pool with 0 workers, causing deadlock issue?
There was a problem hiding this comment.
Good point. I've updated the pool constructor to more elegantly handle this case:
if workers <= 0 {
workers = 1
}
| if c.DataDir == "" { | ||
| return fmt.Errorf("data dir is required") | ||
| } | ||
| if c.CacheSize > 0 && (c.CacheShardCount&(c.CacheShardCount-1)) != 0 { |
There was a problem hiding this comment.
Shall we also validate and make sure CacheShardCount > 0?
There was a problem hiding this comment.
Good point. Validation added.
|
One blocker comment: I think we should decouple cache from db_engine as much as possible, which allows FlatKV to switch different db_engine easily without reimplmenting the cache for each engine in the future |
| defer wg.Done() | ||
| errs[idx] = b.Commit(syncOpt) | ||
| }(i, p.batch) | ||
| err := s.miscPool.Submit(s.ctx, func() { |
There was a problem hiding this comment.
This could a deadlock issue under load.
In commitBatches(), tasks are submitted to miscPool. Each task calls cachedBatch.Commit(), which calls cache.BatchSet(), which also submits to miscPool. If the pool queue is saturated (all workers executing batch commits), the inner BatchSet submissions will block waiting for a free slot — but slots can't free up until workers finish, and workers can't finish until BatchSet completes.
There was a problem hiding this comment.
This is a good observation, but I've already taken precautions against this exact scenario. ;)
There several pool implementations:
- fixed: N worker threads, work only runs on these threads
- elastic: N worker threads, if worker thread is not immediately available then spin up new goroutine
- adhoc: each job gets its own goroutine (for unit tests, mostly)
The misc pool is using an elastic pool. This means that it's safe to send tasks with blocking dependencies, without fear of deadlock.
miscPool := threading.NewElasticPool(ctx, "flatkv-misc", miscPoolSize)
In the elastic pool's implementation, the work queue is a channel of size 0, meaning that if there isn't a worker currently sitting idle, we will immediately fall through to the default case in the select statement.
func (ep *elasticPool) Submit(ctx context.Context, task func()) error {
if task == nil {
return fmt.Errorf("elastic pool: nil task")
}
select {
case <-ctx.Done():
return ctx.Err()
case <-ep.ctx.Done():
return fmt.Errorf("elastic pool is shut down")
case ep.workQueue <- task:
return nil
default:
// All warm workers are busy; spawn a temporary goroutine.
go task()
return nil
}
}
| }(i, db) | ||
| err := s.miscPool.Submit(s.ctx, func() { | ||
| errs[i] = db.Flush() | ||
| wg.Done() |
There was a problem hiding this comment.
we should add defer for wg.Done() to avoid panic hangs forever
There was a problem hiding this comment.
Change made.
| return fmt.Errorf("invalid address length %d for key kind %d", len(keyBytes), kind) | ||
| } | ||
| addrStr := string(addr[:]) | ||
| addrKey := string(AccountKey(addr)) |
There was a problem hiding this comment.
If anyone ever changes AccountKey to add a prefix, or any transformation (which is a natural evolution for a "DB key builder" function), batchReadOldValues would silently fail to find pending writes in s.accountWrites
Suggest to pick one canonical key representation and use it everywhere — either always string(addr[:]) or always string(AccountKey(addr)).
| if !ok { | ||
| continue | ||
| } | ||
| k := string(AccountKey(addr)) |
There was a problem hiding this comment.
consider using string(addr[:]) for the s.accountWrites lookup (matching ApplyChangeSets and store_read.go), and a separate addrKey := string(AccountKey(addr)) for the accountOld/accountBatch maps. Today AccountKey returns addr[:] so the result is the same, but using two different derivations for the same map key is fragile if AccountKey ever changes.
sei-db/state_db/sc/flatkv/config.go
Outdated
| return fmt.Errorf("metadata db config is invalid: %w", err) | ||
| } | ||
|
|
||
| if c.ReaderThreadsPerCore < 0 { |
There was a problem hiding this comment.
nit: <=0 to match error text
| defer wg.Done() | ||
| storageErr = s.storageDB.BatchGet(storageBatch) | ||
| }) | ||
| if err != nil { |
There was a problem hiding this comment.
If a later Submit fails and we return early, previously submitted tasks may still be running and writing to their batch maps? how about calling wg.Wait() before returning on submit error to avoid a race.
There was a problem hiding this comment.
Good point, although I think we can fix this with a slightly simpler solution.
Submit can only fail when contexts get cancelled, i.e. we should only expect to encounter this sort of error during system teardown workflows. So I don't think it's important to optimize performance here, just to make sure it's functionally correct.
The problem is that when we return, we're returning garbage map data. And even worse, we're returning garbage data that isn't threadsafe.
I don't think we need to block until all goroutines are finished. I think the important part is to just always return nil values if we are returning an error. It's ok of the goroutine doesn't immediately stop, as long as the caller isn't receiving unsafe/invalid data.
I've converted error return cases to use the following form:
if err != nil {
return nil, nil, nil, nil, fmt.Errorf("failed to submit batch get: %w", err)
}
There was a problem hiding this comment.
Agreed on the batchReadOldValues fix.
in func commitBatches(), there is a similar issue and I’d still lean toward switching back to plain goroutines, since there’s no returned map state to null out there. the main issue in that path is just the partial-submit/unwind interaction, and direct goroutines seem like the simplest way to avoid that.
There was a problem hiding this comment.
As we discussed offline, I've changed the function signature so that Submit() never returns an error.
| return fmt.Errorf("LogDir is empty, refusing to proceed") | ||
| } | ||
|
|
||
| if cfg.DeleteDataDirOnStartup { |
There was a problem hiding this comment.
In cmd/cryptosim DeleteDataDirOnStartup deletes DataDir, and in cmd/configure-logger it deletes LogDir, enabling this flag now has inconsistent
There was a problem hiding this comment.
Blame @masih for this one, configuring the logger at init time makes this sort of workflow wonky. 😜
I've added seperate configurations for deleting log dirs, so now it should be easier to grok how the settings control the workflow.
| } | ||
|
|
||
| // Validate checks that the configuration is sane and returns an error if it is not. | ||
| func (c *CacheConfig) Validate() error { |
There was a problem hiding this comment.
is this function being used now?
There was a problem hiding this comment.
Good catch, they weren't being called! I've added calls to these methods inside flatKV config.
if err := c.AccountCacheConfig.Validate(); err != nil {
return fmt.Errorf("account cache config is invalid: %w", err)
}
if err := c.CodeCacheConfig.Validate(); err != nil {
return fmt.Errorf("code cache config is invalid: %w", err)
}
if err := c.StorageCacheConfig.Validate(); err != nil {
return fmt.Errorf("storage cache config is invalid: %w", err)
}
if err := c.LegacyCacheConfig.Validate(); err != nil {
return fmt.Errorf("legacy cache config is invalid: %w", err)
}
if err := c.MetadataCacheConfig.Validate(); err != nil {
return fmt.Errorf("metadata cache config is invalid: %w", err)
}
* main: plt-228 fixed static check on app and evmrpc package (#3154) flatkv cache (#3027) Make cryptosim state store backend configurable + No Op Wrapper + Read Disable Config (#3145) Add warning message for IAVL deprecation (#3159) Change default min valid per window to zero (#3157) support for starting autobahn from non-zero global block (#3136) Fix upgrade list comparison to respect semver (#3153)
Describe your changes and provide context
Add a caching layer to FlatKV, more than doubling performance in cryptosim benchmarks.
Testing performed to validate your change
Unit tests, ran benchmark over several days.