Sort HadoopIndexer rows by time+dim bucket to help reduce spilling#1097
Sort HadoopIndexer rows by time+dim bucket to help reduce spilling#1097fjy merged 1 commit intoapache:masterfrom
Conversation
xvrl
commented
Feb 6, 2015
- This should help reduce spilling on the IndexGeneratorJob reducers and consequently also reduce memory usage when merging indices.
- Fixes Sort by rollup key in hadoop indexer #1095
There was a problem hiding this comment.
I don't think getConfig() will be able to return anything here. It is only filled in after hadoop calls "setup".
There was a problem hiding this comment.
Making the thing a method should work (getConfig().getGranularitySpec().getQueryGranularity() probably isn't that expensive compared to all the other crazy stuff we're doing…)
Or if you want to cache it, lazy initializing
|
@xvrl : have you been able to run an indexing task that was showing the error case, or otherwise add tests? |
|
@drcrallen There was nothing wrong with the previous approach, it was just less efficient. I'm planning to run a job or two to see how much helps, but it'll be data dependent. |
3fdcc2f to
f37a0c9
Compare
|
@xvrl I'm more concerned that this causes some other behavior to change, and I'm not familiar with our unit test coverage on the hadoop indexing stuff. |
|
The sortKey in the SortableBytes is never used as part of the reducer, and |
|
I ran some index jobs with and without the patch, using artificially low rowFlushBoundary values to increase the chance of spilling more often. My dataset was not ideal, so the difference is not huge (141 spills with patch vs. 148 spills without) but it still makes a difference which could be much more noticeable for larger data sets. The output indexes are identical when it comes to the inverted index, as well as time and dimension columns. There are some slight differences in the metrics due to floating point rounding errors on the order of 3e-06 at the most on one of the columns, which are inevitable due to the differences in merge order. |
09e6de8 to
d5ee791
Compare
d5ee791 to
b1ec7af
Compare
There was a problem hiding this comment.
Probably doesn't matter, but I don't think the truncatedTimestamp is technically needed in the hashed dimensions set.
There was a problem hiding this comment.
Agree, I just left it out of precaution as part of the sort key, just in case something in the reducer inadvertently depended on input rows being sorted by time first.
|
I think this change should be safe. Though I do agree with @drcrallen that we are probably lacking in test coverage. It's probably too much to ask to get a unit test in here, but if we could take this chance to think about how this could be unit tested and maybe put those thoughts into an Issue, that might be useful? Also, if memory serves, the HadoopIndexerJob does the rollup portion of its processing totally in the reducer, meaning if that the data isn't rolled up ahead of time, then we are pushing a lot more data across the wire than needed. It's use case dependent, but I would think that getting a combiner involved such that we can do a partial rollup might actually result in better overall performance of the hadoop jobs. This also isn't a reason to block this PR, just a thought that came to mind when read the code. |
|
Agree, doing it as part of the map / combine phase could certainly help reduce time spent pushing data. That's a bigger change though and something I wouldn't want to include at the last minute. The purpose of this fix is purely to help memory pressure be a function of segment size rather than data size. |
Sort HadoopIndexer rows by time+dim bucket to help reduce spilling