Skip to content

[WIP] Mapside aggregating hadoop indexer#2670

Closed
navis wants to merge 5 commits intoapache:masterfrom
navis:combining-hadoop-indexer
Closed

[WIP] Mapside aggregating hadoop indexer#2670
navis wants to merge 5 commits intoapache:masterfrom
navis:combining-hadoop-indexer

Conversation

@navis
Copy link
Copy Markdown
Contributor

@navis navis commented Mar 16, 2016

Current hadoop indexer pushes rows one by one to context but if timestamp of rows are not varied (like hourly batch) that much we can aggregate rows in mapper memory first. I know there is combiner in hadoop but it's infamously inefficient and even hive didn't used that.

@navis
Copy link
Copy Markdown
Contributor Author

navis commented Mar 16, 2016

Waiting #2650

@binlijin
Copy link
Copy Markdown
Contributor

@navis , how about the performance improved?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 16, 2016

have not read the impl yet, but this is a nice concept IMO. my one request at a high level is that it should either be off by default or have a very conservative default for maxRowsInMemory, to prevent people from hitting OOMEs on upgrade.

@himanshug
Copy link
Copy Markdown
Contributor

hadoop combiner allows you to truly merge "all" the rows that are mergable at mapper because it is pretty much like the reducer and hadoop will carefully supply it all the rows that were mergeable together.
change in this PR can only do either best effort merging or it needs to hold a lot of IncrementalIndex in memory to be able to do the merging of all rows that were mergeable (remember mappers wouldn't get input data in any particular order).
Did you try to use "useCombiner" and notice any major problems which does the same thing? Is using hadoop combiner really becoming the bottleneck ?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 16, 2016

My experience in general is that combiners are good if reducers are a big bottleneck and mappers are not. But if things are more balanced between mappers and reducers, then doing in-heap aggregation on the mappers incurs less mapper overhead than using a combiner (combiners involve serde and sorting costs). So it could be better or worse depending on the workload.

It's kind of a point of contention in the hadoop world though (some projects prefer combiners and some prefer in-heap aggregation…). Some even support both at the same time :)

It would certainly be good to know real numbers of this approach vs useCombiner for your workload, @navis. Even better if you could also share numbers of this approach + useCombiner used together vs either one alone.

@himanshug
Copy link
Copy Markdown
Contributor

@gianm let say very first row and very last row in the dataset are mergeable together, then only way you will merge them without combiner if you held the IncrementalIndex for that group key till very last moment. Now multiply that by the total number of distinct group keys mapper gets if you truly want to merge everything that could be merged.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Mar 16, 2016

@himanshug it's not necessary to merge everything, the idea with in-heap aggregation is that it's just best effort, but that can be good enough for a substantial reduction in data sent to the reducers without incurring the overhead of guaranteed merging at both the mapper and reducer levels.

would be good to confirm for some real world workloads whether or not this approach works better than a combiner.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to check whether inputRow is instance of SegmentInputRow.
It seems that combiningAggs should be used instead of aggregators in that case.

@sirpkt
Copy link
Copy Markdown
Contributor

sirpkt commented Mar 16, 2016

IMO this patch looks really good especially when the input data is roughly sorted by timestamp, and my machine logs are like that.
Map disk spill is one of major performance bottleneck in MR but combiner does nothing about that.
This patch can reduce the number of spills with minimized map output.
Even it works as best effort, it would be well fit with time series data I think.

@navis
Copy link
Copy Markdown
Contributor Author

navis commented Mar 17, 2016

It needs more elaboration but,
with unique 740K rows, 54 sec took for current and 83 sec with this.
with unique 100K rows * 10 duplication, 37 sec for current (36 sec with combine) and 24 sec with this.

@navis navis force-pushed the combining-hadoop-indexer branch from fe9b575 to 67d1100 Compare March 17, 2016 08:31
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about use bucket intervals to make each IncrementalIndex to reduce memory footprint(maybe)? like the below.
config.getGranularitySpec().bucketInterval(timestamp).get();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed that

@navis
Copy link
Copy Markdown
Contributor Author

navis commented Mar 22, 2016

I think @gianm already explained well (thanks!) on the intention of this patch. Current serde for inputRow is making very big complex binary and seemed not possible to use binary combiner (forgot the exact name of this in hadoop). Then for combining rows hadoop must read object form binary and write it again, which is big work for both cpu and memory.

@navis navis force-pushed the combining-hadoop-indexer branch 2 times, most recently from 1eae336 to 1ee16a8 Compare May 20, 2016 01:05
@navis navis force-pushed the combining-hadoop-indexer branch from 1ee16a8 to 6c4d922 Compare June 7, 2016 02:56
@stale
Copy link
Copy Markdown

stale Bot commented Feb 28, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale Bot added the stale label Feb 28, 2019
@stale
Copy link
Copy Markdown

stale Bot commented Mar 7, 2019

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale Bot closed this Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants