Sort HadoopIndexer rows by time+dim bucket to help reduce spilling by xvrl · Pull Request #1097 · apache/druid

xvrl · 2015-02-06T22:02:08Z

This should help reduce spilling on the IndexGeneratorJob reducers and consequently also reduce memory usage when merging indices.
Fixes Sort by rollup key in hadoop indexer #1095

gianm · 2015-02-06T22:05:09Z

I don't think getConfig() will be able to return anything here. It is only filled in after hadoop calls "setup".

Making the thing a method should work (getConfig().getGranularitySpec().getQueryGranularity() probably isn't that expensive compared to all the other crazy stuff we're doing…)

Or if you want to cache it, lazy initializing

drcrallen · 2015-02-06T22:05:22Z

@xvrl : have you been able to run an indexing task that was showing the error case, or otherwise add tests?

xvrl · 2015-02-06T22:14:41Z

@drcrallen There was nothing wrong with the previous approach, it was just less efficient. I'm planning to run a job or two to see how much helps, but it'll be data dependent.

drcrallen · 2015-02-06T22:39:17Z

@xvrl I'm more concerned that this causes some other behavior to change, and I'm not familiar with our unit test coverage on the hadoop indexing stuff.

xvrl · 2015-02-06T23:13:00Z

The sortKey in the SortableBytes is never used as part of the reducer, and IndexGeneratorMapper class is only used as part of this job, so it should only affect sorting in hadoop. The timestamp sorting is preserved like before, but truncated down to the rollup granularity, which should not be an issue.

xvrl · 2015-02-06T23:36:44Z

I ran some index jobs with and without the patch, using artificially low rowFlushBoundary values to increase the chance of spilling more often. My dataset was not ideal, so the difference is not huge (141 spills with patch vs. 148 spills without) but it still makes a difference which could be much more noticeable for larger data sets.

The output indexes are identical when it comes to the inverted index, as well as time and dimension columns. There are some slight differences in the metrics due to floating point rounding errors on the order of 3e-06 at the most on one of the columns, which are inevitable due to the differences in merge order.

cheddar · 2015-02-12T23:51:01Z

Probably doesn't matter, but I don't think the truncatedTimestamp is technically needed in the hashed dimensions set.

Agree, I just left it out of precaution as part of the sort key, just in case something in the reducer inadvertently depended on input rows being sorted by time first.

cheddar · 2015-02-12T23:54:31Z

I think this change should be safe. Though I do agree with @drcrallen that we are probably lacking in test coverage. It's probably too much to ask to get a unit test in here, but if we could take this chance to think about how this could be unit tested and maybe put those thoughts into an Issue, that might be useful?

Also, if memory serves, the HadoopIndexerJob does the rollup portion of its processing totally in the reducer, meaning if that the data isn't rolled up ahead of time, then we are pushing a lot more data across the wire than needed. It's use case dependent, but I would think that getting a combiner involved such that we can do a partial rollup might actually result in better overall performance of the hadoop jobs. This also isn't a reason to block this PR, just a thought that came to mind when read the code.

xvrl · 2015-02-13T00:19:08Z

Agree, doing it as part of the map / combine phase could certainly help reduce time spent pushing data. That's a bigger change though and something I wouldn't want to include at the last minute. The purpose of this fix is purely to help memory pressure be a function of segment size rather than data size.

Sort HadoopIndexer rows by time+dim bucket to help reduce spilling

gianm reviewed Feb 6, 2015
View reviewed changes

xvrl force-pushed the better-hadoop-sort-key branch from 3fdcc2f to f37a0c9 Compare February 6, 2015 22:22

xvrl force-pushed the better-hadoop-sort-key branch 3 times, most recently from 09e6de8 to d5ee791 Compare February 10, 2015 18:00

Sort HadoopIndexer rows by time+dim bucket to help reduce spilling

b1ec7af

xvrl force-pushed the better-hadoop-sort-key branch from d5ee791 to b1ec7af Compare February 10, 2015 22:26

cheddar reviewed Feb 12, 2015
View reviewed changes

xvrl mentioned this pull request Feb 13, 2015

IndexGeneratorJob should start doing rollup in map / combine phase instead of just reduce #1122

Closed

xvrl added the Performance label Feb 13, 2015

xvrl added this to the 0.7.0 milestone Feb 13, 2015

xvrl self-assigned this Feb 13, 2015

xvrl modified the milestone: 0.7.0 Feb 20, 2015

fjy added a commit that referenced this pull request Feb 25, 2015

Merge pull request #1097 from metamx/better-hadoop-sort-key

6424815

Sort HadoopIndexer rows by time+dim bucket to help reduce spilling

fjy merged commit 6424815 into apache:master Feb 25, 2015

fjy deleted the better-hadoop-sort-key branch February 25, 2015 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort HadoopIndexer rows by time+dim bucket to help reduce spilling#1097

Sort HadoopIndexer rows by time+dim bucket to help reduce spilling#1097
fjy merged 1 commit intoapache:masterfrom
metamx:better-hadoop-sort-key

xvrl commented Feb 6, 2015

Uh oh!

gianm Feb 6, 2015

Uh oh!

gianm Feb 6, 2015

Uh oh!

drcrallen commented Feb 6, 2015

Uh oh!

xvrl commented Feb 6, 2015

Uh oh!

drcrallen commented Feb 6, 2015

Uh oh!

xvrl commented Feb 6, 2015

Uh oh!

xvrl commented Feb 6, 2015

Uh oh!

cheddar Feb 12, 2015

Uh oh!

xvrl Feb 13, 2015

Uh oh!

cheddar commented Feb 12, 2015

Uh oh!

xvrl commented Feb 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xvrl commented Feb 6, 2015

Uh oh!

gianm Feb 6, 2015

Choose a reason for hiding this comment

Uh oh!

gianm Feb 6, 2015

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Feb 6, 2015

Uh oh!

xvrl commented Feb 6, 2015

Uh oh!

drcrallen commented Feb 6, 2015

Uh oh!

xvrl commented Feb 6, 2015

Uh oh!

xvrl commented Feb 6, 2015

Uh oh!

cheddar Feb 12, 2015

Choose a reason for hiding this comment

Uh oh!

xvrl Feb 13, 2015

Choose a reason for hiding this comment

Uh oh!

cheddar commented Feb 12, 2015

Uh oh!

xvrl commented Feb 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants