Skip to content

optimize create inverted indexes#2111

Merged
binlijin merged 1 commit intoapache:masterfrom
binlijin:optimize-create-inverted-indexes
Jan 22, 2016
Merged

optimize create inverted indexes#2111
binlijin merged 1 commit intoapache:masterfrom
binlijin:optimize-create-inverted-indexes

Conversation

@binlijin
Copy link
Copy Markdown
Contributor

In index persist or merge when "Create Inverted Indexes" phase, it iterate dim's every value, then get the value's dictionary id in each index to get the bitmap.
We can direct iterate value's dictionary id, and get the corresponding dictionary id in each index from dimConversion to get the bitmap.
This can improve performance much when dim's cardinality high.

Current i do not see any improvement when the data is small.
But we find when large data do hadoop batch ingest and with some high cardinality dimensions the create inverted indexes in Index merger takes the most time.
I will do the performance later with large data.

@binlijin
Copy link
Copy Markdown
Contributor Author

Performance number1 :
Before:
2015-12-18 08:55:39,529 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base472859607006656847flush/merged/v8-tmp] completed walk through of 11,192,533 rows in 295,312 millis.

2015-12-18 08:58:31,493 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398]
2015-12-18 08:59:57,578 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 86,085 millis.

2015-12-18 09:02:06,165 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base472859607006656847flush/merged/v8-tmp] completed inverted.drd in 386,635 millis.

After:
2015-12-18 08:40:15,936 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base4666050658270672045flush/merged/v8-tmp] completed walk through of 11,192,533 rows in 292,092 millis.

2015-12-18 08:43:03,655 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[10,493,398]
2015-12-18 08:43:22,763 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 19,108 millis.

2015-12-18 08:45:03,878 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base4666050658270672045flush/merged/v8-tmp] completed inverted.drd in 287,941 millis.

@binlijin
Copy link
Copy Markdown
Contributor Author

Performance number2 :
Before:
2015-12-18 09:44:16,345 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base6193429426037721634flush/merged/v8-tmp] completed walk through of 4,477,564 rows in 112,079 millis.

2015-12-18 09:45:12,948 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606]
2015-12-18 09:45:32,210 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 19,262 millis.

2015-12-18 09:46:15,038 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base6193429426037721634flush/merged/v8-tmp] completed inverted.drd in 118,692 millis.

After:
2015-12-18 09:27:56,696 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base5295145984422027811flush/merged/v8-tmp] completed walk through of 4,477,564 rows in 119,256 millis.

2015-12-18 09:28:52,253 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Starting dimension[nid] with cardinality[4,362,606]
2015-12-18 09:28:58,954 INFO [main] segment.IndexMerger (Logger.java:info(70)) - Completed dimension[nid] in 6,701 millis.

2015-12-18 09:29:33,492 INFO [main] segment.IndexMerger (Logger.java:info(70)) - outDir[/tmp/base5295145984422027811flush/merged/v8-tmp] completed inverted.drd in 96,796 millis.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 18, 2015

@binlijin just looking at your merging times, have you thought about sharding your data more?

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 18, 2015

In any case, this is cool

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor formatting, need a space here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's a bunch of formatting stuff in this PR. Please make sure to use the style guide.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use eclipse and find the eclipse_formatting.xml is not useful, and i will try use IntelliJ.

@binlijin
Copy link
Copy Markdown
Contributor Author

@fjy, we have a big datasource and every day we need to build 30 billion records, so may be we need more big segment.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 21, 2015

@binlijin You can create multiple segments for the same time interval with different shard numbers. I think you should try to keep segments around 5M rows. This is what we did for 100+ billion records per day.

@binlijin
Copy link
Copy Markdown
Contributor Author

@fjy, The big datasource we need to keep 15 day's data, and will do query on per day's data,so what if we have too much segments, do druid can handle?

@binlijin binlijin closed this Dec 22, 2015
@binlijin binlijin reopened this Dec 22, 2015
@binlijin binlijin closed this Dec 28, 2015
@binlijin binlijin reopened this Dec 28, 2015
@binlijin binlijin closed this Dec 28, 2015
@binlijin binlijin reopened this Dec 28, 2015
@binlijin
Copy link
Copy Markdown
Contributor Author

@fjy, what is the problem and why the travis fail?

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 29, 2015

@binlijin there's a couple of non-deterministic unit tests

If you pull the latest master and merge in #2165, things should pass

@binlijin
Copy link
Copy Markdown
Contributor Author

@fjy, thanks..

@binlijin binlijin closed this Dec 29, 2015
@binlijin binlijin reopened this Dec 29, 2015
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 29, 2015

👍 this looks good to me now, but I think someone else who knows this code should do a review as well

@binlijin binlijin closed this Dec 30, 2015
@binlijin binlijin reopened this Dec 30, 2015
@binlijin
Copy link
Copy Markdown
Contributor Author

binlijin commented Jan 7, 2016

Related to #2138

@binlijin
Copy link
Copy Markdown
Contributor Author

binlijin commented Jan 7, 2016

@xvrl can you take a look?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the comments to explain how BitmapIndexHolder works?

@binlijin
Copy link
Copy Markdown
Contributor Author

ping @xvrl

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to call nullToEmpty here? this seems it might be an artifact of wrapping DimDim with NullValueConverterDimDim, however getBitmapIndex relies on the actual values stored in DimDim, not the values returned by the wrapper, it that correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, NullValueConverterDimDim will convert empty to null, we need convert it back to the actual values, because getBitmapIndex relies on the actual values stored in DimDim.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this warrants a comment, given that it took me a while to track down the reason for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done, already add a comment for this.

@binlijin binlijin closed this Jan 18, 2016
@binlijin
Copy link
Copy Markdown
Contributor Author

@xvrl rebase

@binlijin binlijin reopened this Jan 18, 2016
@binlijin binlijin closed this Jan 20, 2016
@binlijin
Copy link
Copy Markdown
Contributor Author

rebase

@binlijin binlijin reopened this Jan 20, 2016
@fjy fjy added this to the 0.9.0 milestone Jan 20, 2016
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jan 21, 2016

@himanshug can you take a look to help finish this off?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe EmptyIndexedInts should be a singleton, it already has a static final instance, can you use that? also make the no arg constructor in that class be private.

@himanshug
Copy link
Copy Markdown
Contributor

@binlijin can you update the PR description with a summary of why this change improves performance, it will be helpful to anyone looking at PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make both static variables private as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see that they are used in other places.

@binlijin
Copy link
Copy Markdown
Contributor Author

@himanshug do update the PR description.

@binlijin binlijin closed this Jan 21, 2016
@binlijin
Copy link
Copy Markdown
Contributor Author

rebase

@binlijin binlijin reopened this Jan 21, 2016
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should add an Assert.fail(..) here or else the verification doesn't happen for the case when exception is not thrown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, i will fix it.

@himanshug
Copy link
Copy Markdown
Contributor

👍 after #2111 (comment) is resolved.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jan 21, 2016

I'm still 👍

@binlijin feel free to merge this after you address @himanshug's comment

@binlijin binlijin closed this Jan 22, 2016
@binlijin
Copy link
Copy Markdown
Contributor Author

rebase and fix test

@binlijin binlijin reopened this Jan 22, 2016
binlijin added a commit that referenced this pull request Jan 22, 2016
@binlijin binlijin merged commit 1d1f4d9 into apache:master Jan 22, 2016
@fjy fjy mentioned this pull request Feb 5, 2016
@binlijin binlijin deleted the optimize-create-inverted-indexes branch February 18, 2016 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants