Simplifying dimension merging#2094
Conversation
80a3377 to
72d4987
Compare
|
Reduced dim conversion time from 11.6 sec to 7.1 sec, with 12 index with 500K rows each. |
There was a problem hiding this comment.
What about the performance improvement when use IntBuffer instead of int[]?
There was a problem hiding this comment.
Don't think it will affect performace (I'll check that tomorrow). Is IntBuffer is better than int[]? Less intuitive for me.
There was a problem hiding this comment.
I just do not know why use IntBuffer,and other people can tell why.
There was a problem hiding this comment.
previously the IntBuffer was allocated off-heap, with this change it will be on-heap. Whether this is better or worse can be debated. Currently we rely on garbage collection to hopefully cleanup our buffers before we run out of memory, but allocating it on-heap may create lots of heap pressure if those arrays are big and long-lived. It might be worth some benchmarks to get a sense of how things will be affected.
There was a problem hiding this comment.
With direct IntBuffer, it took 8 seconds, not making any notable differences in performance. I prefer int[] but I'm ok with direct buffer. Opinion?
There was a problem hiding this comment.
Heap pressure is already really bad during the making of index files, and making sure we know how this change impacts heap pressure during that time is important. During the merge and persist phase of realtime tasks, we already have very high CPU usage, enough to where you have to be aware how it impacts query performance. Adding in more heap pressure during that phase should be done with great care.
There are also issues with heap size during the reduce portion of hadoop tasks (or spark batch tasks). So I'm curious if adding more objects (int[]) messes the limit of the number of rows per segment (or if it impacts high cardinality dimensions).
There was a problem hiding this comment.
Reverted to direct IntBuffer.
|
👍 |
98e5a91 to
82bab15
Compare
There was a problem hiding this comment.
can we have a more descriptive name than counter?
There was a problem hiding this comment.
renamed to numMergeIndex
82bab15 to
a278fe5
Compare
|
👍 This looks good to me, but I think someone else familiar with this code needs to do a review |
|
+1 |
|
can you squash the commits ? |
a278fe5 to
35bc224
Compare
|
@binlijin squashed. thanks. |
|
@fjy ok |
182a0e1 to
c1b0f06
Compare
|
@binlijin It's becoming more and more painful to rebase. Do you really have a time to look into this? |
1 similar comment
|
@navis yes, there's multiple teams working on the same code, which is why we need proposals to coordinate, and all the changes are important |
There was a problem hiding this comment.
can these be static? also can we define them at the top of the file?
There was a problem hiding this comment.
remove timer. just used to check the performance.
There was a problem hiding this comment.
why not use Guava's Stopwatch?
c1b0f06 to
05fc7dc
Compare
|
Rebased on master, barely. I'll address comments. |
There was a problem hiding this comment.
Can you rename "WithConversion", e.g. ConvertingIndexSeeker or IndexSeekerWithConversion? Currently it sounds a bit like a boolean parameter and it's not immediately clear that it's a seeker
|
👍 looks good to me after addressing the Seeker renaming comments |
d55fc5e to
d74e526
Compare
|
@himanshug @xvrl do you want to take a look? I think this is getting close to ready and will merge unless there's more comments |
There was a problem hiding this comment.
assuming above code is as is moved into method makeRowIterable(..)
There was a problem hiding this comment.
Yes, made a method to be used in V9Merger.
|
👍 pls squash. |
d74e526 to
dd774ef
Compare
|
squashed |
Simplifying dimension merging
Currently, dimension merging is processed by two stages. One for dictionary, one for index. If this can be processed by single stage, total processing time could be deceased.