optimize create inverted indexes#2111
optimize create inverted indexes#2111binlijin merged 1 commit intoapache:masterfrom binlijin:optimize-create-inverted-indexes
Conversation
|
Performance number1 : 2015-12-18 08:58:31,493 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398] 2015-12-18 09:02:06,165 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base472859607006656847flush/merged/v8-tmp] completed inverted.drd in 386,635 millis. After: 2015-12-18 08:43:03,655 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398] 2015-12-18 08:45:03,878 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base4666050658270672045flush/merged/v8-tmp] completed inverted.drd in 287,941 millis. |
|
Performance number2 : 2015-12-18 09:45:12,948 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606] 2015-12-18 09:46:15,038 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base6193429426037721634flush/merged/v8-tmp] completed inverted.drd in 118,692 millis. After: 2015-12-18 09:28:52,253 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606] 2015-12-18 09:29:33,492 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base5295145984422027811flush/merged/v8-tmp] completed inverted.drd in 96,796 millis. |
|
@binlijin just looking at your merging times, have you thought about sharding your data more? |
|
In any case, this is cool |
There was a problem hiding this comment.
minor formatting, need a space here
There was a problem hiding this comment.
Actually there's a bunch of formatting stuff in this PR. Please make sure to use the style guide.
There was a problem hiding this comment.
I use eclipse and find the eclipse_formatting.xml is not useful, and i will try use IntelliJ.
|
@fjy, we have a big datasource and every day we need to build 30 billion records, so may be we need more big segment. |
|
@binlijin You can create multiple segments for the same time interval with different shard numbers. I think you should try to keep segments around 5M rows. This is what we did for 100+ billion records per day. |
|
@fjy, The big datasource we need to keep 15 day's data, and will do query on per day's data,so what if we have too much segments, do druid can handle? |
|
@fjy, what is the problem and why the travis fail? |
|
@fjy, thanks.. |
|
👍 this looks good to me now, but I think someone else who knows this code should do a review as well |
|
Related to #2138 |
|
@xvrl can you take a look? |
There was a problem hiding this comment.
can we update the comments to explain how BitmapIndexHolder works?
|
ping @xvrl |
There was a problem hiding this comment.
is there a reason to call nullToEmpty here? this seems it might be an artifact of wrapping DimDim with NullValueConverterDimDim, however getBitmapIndex relies on the actual values stored in DimDim, not the values returned by the wrapper, it that correct?
There was a problem hiding this comment.
Yes, you are right, NullValueConverterDimDim will convert empty to null, we need convert it back to the actual values, because getBitmapIndex relies on the actual values stored in DimDim.
There was a problem hiding this comment.
I think this warrants a comment, given that it took me a while to track down the reason for this.
There was a problem hiding this comment.
Yes, done, already add a comment for this.
|
@xvrl rebase |
|
rebase |
|
@himanshug can you take a look to help finish this off? |
There was a problem hiding this comment.
I believe EmptyIndexedInts should be a singleton, it already has a static final instance, can you use that? also make the no arg constructor in that class be private.
|
@binlijin can you update the PR description with a summary of why this change improves performance, it will be helpful to anyone looking at PR. |
There was a problem hiding this comment.
can u make both static variables private as well?
There was a problem hiding this comment.
ok, i see that they are used in other places.
|
@himanshug do update the PR description. |
|
rebase |
There was a problem hiding this comment.
you should add an Assert.fail(..) here or else the verification doesn't happen for the case when exception is not thrown.
There was a problem hiding this comment.
Good catch, i will fix it.
|
👍 after #2111 (comment) is resolved. |
|
I'm still 👍 @binlijin feel free to merge this after you address @himanshug's comment |
|
rebase and fix test |
optimize create inverted indexes
In index persist or merge when "Create Inverted Indexes" phase, it iterate dim's every value, then get the value's dictionary id in each index to get the bitmap.
We can direct iterate value's dictionary id, and get the corresponding dictionary id in each index from dimConversion to get the bitmap.
This can improve performance much when dim's cardinality high.
Current i do not see any improvement when the data is small.
But we find when large data do hadoop batch ingest and with some high cardinality dimensions the create inverted indexes in Index merger takes the most time.
I will do the performance later with large data.