perf: implement batch processing in iterateEvalTree#406
Conversation
|
@seqbenchbot up main search-keyword-exact-match-warm |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## 329-batching-1 #406 +/- ##
==================================================
- Coverage 71.54% 70.58% -0.97%
==================================================
Files 220 221 +1
Lines 16568 20423 +3855
==================================================
+ Hits 11854 14415 +2561
- Misses 3840 5128 +1288
- Partials 874 880 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@seqbenchbot down e8eefca9 |
|
Nice, @cheb0 The benchmark with identificator Show summary
Have a great time! |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| for _, lid := range lids { | ||
| rawLid := lid.Unpack() | ||
| blockIdx := p.table.GetIDBlockIndexByLID(rawLid) | ||
| if p.midCache.blockIndex != int(blockIdx) { |
There was a problem hiding this comment.
nit: fillMIDs has this check inside. did you add it to avoid function call?
| // Get MIDs | ||
| if needMids > 0 { | ||
| timerMID.Start() | ||
| mids = idsIndex.GetMIDs(lidsSlice[0:needMids], mids[:0]) |
There was a problem hiding this comment.
nit: technically we can omit the lower bound if it equals 0
lidsSlice[:needMids]| return seq.MID(p.midCache.GetValByLID(uint32(lid))), nil | ||
| } | ||
|
|
||
| func (p *Provider) MIDs(lids []node.LID, out []seq.MID) ([]seq.MID, error) { |
There was a problem hiding this comment.
Why Provider has method for retrieving a batch of MID but there is no similar method for RID?
| defer searchBuffersPool.Put(buffers) | ||
| mids := buffers.mids | ||
| rids := buffers.rids | ||
| lidsBuffer := buffers.lids |
There was a problem hiding this comment.
Shouldn't you reset buffers since slices are reused?
There was a problem hiding this comment.
looks like no, since I do not reassign it
| lidsBuf := lidsBuf{ | ||
| lids: make([]node.LID, 0, consts.LIDBlockCap), | ||
| } | ||
| return searchBuffers{ |
There was a problem hiding this comment.
It's better to return a pointer here, otherwise there will be unnecessary allocations since any is returned.
| filterMIDs := sw.Timer("filter_mids") | ||
| updateHist := sw.Timer("update_hist") |
There was a problem hiding this comment.
| filterMIDs := sw.Timer("filter_mids") | |
| updateHist := sw.Timer("update_hist") | |
| timerFilterMIDs := sw.Timer("filter_mids") | |
| timerUpdateHist := sw.Timer("update_hist") |
| @@ -47,12 +49,23 @@ type LIDsIter interface { | |||
There was a problem hiding this comment.
| LIDs(out []node.LID) []node.LID |
There was a problem hiding this comment.
I'll leave it here since it is out of scope of this diff.
Take a look at https://github.com/ozontech/seq-db/blob/329-batching-iterate-eval-tree/frac/sealed/lids/iterator_desc.go#L121-L131 -- I guess you've introduced code duplication while performing rebase.
| return total, ids, hist, aggs, nil | ||
| } | ||
|
|
||
| func filterOutOfRangeMIDs(params SearchParams, mids []seq.MID, lidsSlice []node.LID) ([]seq.MID, []node.LID) { |
There was a problem hiding this comment.
I am not sure what purpose this function serves.
Per my understanding, we cannot iterate over seq.LID which correspond to seq.ID that lie outside of user-requested range [from; to] -- this is guaranteed because we calculate minLID and maxLID in getLIDsBorders and use those in all iterators to set boundaries.
Am I missing something?
There was a problem hiding this comment.
This check is right from the original implementation, I have not added it: https://github.com/ozontech/seq-db/blob/main/frac/processor/search.go#L229.
The only thing it does is converting potential panic to error if minLID and maxLID deriving does not work somehow. And it's enabled only for histrograms. But it's more appropriate to panic in this case. Basically, it looks useless to me.
| buffers := searchBuffersPool.Get().(searchBuffers) | ||
| defer searchBuffersPool.Put(buffers) | ||
| mids := buffers.mids | ||
| rids := buffers.rids |
There was a problem hiding this comment.
Starting a petition to protect Vim users and their descendants — we require spaces. This is how we navigate code. Thank you for your cooperation.
Maybe something like?
var (
total int
lastID seq.ID
ids seq.IDSources
)
buffers := searchBuffersPool.Get().(searchBuffers)
defer searchBuffersPool.Put(buffers)| } | ||
| // limit how much we drain from eval tree for one-by-one flow. ignored for batched flow | ||
| need = min(need, maxLidsToDrain) | ||
| needLids = min(needLids, maxLidsToDrain) |
There was a problem hiding this comment.
Maybe we can move this whole thing with calculating limits/offsets/etc to the batch? I mean something like:
if ok {
evalTreeIter = func(need int, _ lidsBuf) LIDsIter {
// batched flow: juts get a batch and return
return batchNode.NextBatch().Trim(need)
// Or return batchNode.NextBatch(need)
}
} else {
...
}
func (b LIDBatch) Trim(k int) LIDBatch {
b.lids = b.lids[:min(k, len(b.lids))]
return b
}There was a problem hiding this comment.
I would prefer batchNode.NextBatch(need) but it needs more work to do.
Description
Continuation of #390
iterateEvalTreeworks with batches of lids, requests batches of mids and ridsget_midstepI did some measurements for both patches (this combined with #390) vs main (used bitpack encoding in both branches). For small ordinary searches there is no benefit. For dense analytic queries there is a decent improvement.
For our k6 benchmark
seq-db-hist.js:2.3 sec=>650 msFor
seq-db-aggs.js:6.1 sec=>4.7 secHist over
_all_(warm query) (3 prod fractions):~37 ms=>~15 msPart of #329