Skip to content

Sort HadoopIndexer rows by time+dim bucket to help reduce spilling#1097

Merged
fjy merged 1 commit intoapache:masterfrom
metamx:better-hadoop-sort-key
Feb 25, 2015
Merged

Sort HadoopIndexer rows by time+dim bucket to help reduce spilling#1097
fjy merged 1 commit intoapache:masterfrom
metamx:better-hadoop-sort-key

Conversation

@xvrl
Copy link
Copy Markdown
Member

@xvrl xvrl commented Feb 6, 2015

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think getConfig() will be able to return anything here. It is only filled in after hadoop calls "setup".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the thing a method should work (getConfig().getGranularitySpec().getQueryGranularity() probably isn't that expensive compared to all the other crazy stuff we're doing…)

Or if you want to cache it, lazy initializing

@drcrallen
Copy link
Copy Markdown
Contributor

@xvrl : have you been able to run an indexing task that was showing the error case, or otherwise add tests?

@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Feb 6, 2015

@drcrallen There was nothing wrong with the previous approach, it was just less efficient. I'm planning to run a job or two to see how much helps, but it'll be data dependent.

@xvrl xvrl force-pushed the better-hadoop-sort-key branch from 3fdcc2f to f37a0c9 Compare February 6, 2015 22:22
@drcrallen
Copy link
Copy Markdown
Contributor

@xvrl I'm more concerned that this causes some other behavior to change, and I'm not familiar with our unit test coverage on the hadoop indexing stuff.

@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Feb 6, 2015

The sortKey in the SortableBytes is never used as part of the reducer, and IndexGeneratorMapper class is only used as part of this job, so it should only affect sorting in hadoop. The timestamp sorting is preserved like before, but truncated down to the rollup granularity, which should not be an issue.

@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Feb 6, 2015

I ran some index jobs with and without the patch, using artificially low rowFlushBoundary values to increase the chance of spilling more often. My dataset was not ideal, so the difference is not huge (141 spills with patch vs. 148 spills without) but it still makes a difference which could be much more noticeable for larger data sets.

The output indexes are identical when it comes to the inverted index, as well as time and dimension columns. There are some slight differences in the metrics due to floating point rounding errors on the order of 3e-06 at the most on one of the columns, which are inevitable due to the differences in merge order.

@xvrl xvrl force-pushed the better-hadoop-sort-key branch 3 times, most recently from 09e6de8 to d5ee791 Compare February 10, 2015 18:00
@xvrl xvrl force-pushed the better-hadoop-sort-key branch from d5ee791 to b1ec7af Compare February 10, 2015 22:26
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't matter, but I don't think the truncatedTimestamp is technically needed in the hashed dimensions set.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I just left it out of precaution as part of the sort key, just in case something in the reducer inadvertently depended on input rows being sorted by time first.

@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Feb 12, 2015

I think this change should be safe. Though I do agree with @drcrallen that we are probably lacking in test coverage. It's probably too much to ask to get a unit test in here, but if we could take this chance to think about how this could be unit tested and maybe put those thoughts into an Issue, that might be useful?

Also, if memory serves, the HadoopIndexerJob does the rollup portion of its processing totally in the reducer, meaning if that the data isn't rolled up ahead of time, then we are pushing a lot more data across the wire than needed. It's use case dependent, but I would think that getting a combiner involved such that we can do a partial rollup might actually result in better overall performance of the hadoop jobs. This also isn't a reason to block this PR, just a thought that came to mind when read the code.

@xvrl
Copy link
Copy Markdown
Member Author

xvrl commented Feb 13, 2015

Agree, doing it as part of the map / combine phase could certainly help reduce time spent pushing data. That's a bigger change though and something I wouldn't want to include at the last minute. The purpose of this fix is purely to help memory pressure be a function of segment size rather than data size.

@xvrl xvrl added this to the 0.7.0 milestone Feb 13, 2015
@xvrl xvrl self-assigned this Feb 13, 2015
@xvrl xvrl modified the milestone: 0.7.0 Feb 20, 2015
fjy added a commit that referenced this pull request Feb 25, 2015
Sort HadoopIndexer rows by time+dim bucket to help reduce spilling
@fjy fjy merged commit 6424815 into apache:master Feb 25, 2015
@fjy fjy deleted the better-hadoop-sort-key branch February 25, 2015 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sort by rollup key in hadoop indexer

5 participants