add config to optionally disable all compression in intermediate segment persists while ingestion by himanshug · Pull Request #7919 · apache/druid

himanshug · 2019-06-18T08:59:57Z

Motivation:
All Druid ingestion tasks consume excessive amount of off-heap memory during final merge of intermediate persisted segments. A good fraction of that memory goes towards holding many 64K decompression buffers, recently reported in #7824 and I have seen this behavior on many different clusters in the past. Most of these 64K decompression buffers can be avoided if compression was disabled on intermediate persisted segments which is not really required on those intermediate segments but on the final segment that gets pushed.

Changes Made:
This patch disables compression on intermediate persisted segments for all ingestion tasks. However, in case this causes problems, a configuration "indexSpecForIntermediatePersists" could be used inside the task tuningConfig to revert the behavior on specific clusters.
This config is intentionally undocumented to not complicate user documentation as I don't foresee it being used really. Most of the changes in patch are actually to add this config.
This patch adds another tuning config "indexSpecForIntermediatePersists" that could be used to disable dimension/metric compression on intermediate persisted segments.

Testing:
Existing unit and integration tests should cover all changes here.

Release Notes:
This PR must be mentioned in the release notes to inform users about this change in behavior and specially to inform about "indexSpecForIntermediatePersists" config in case it needs to be reverted in certain case.
It is now possible to include indexSpecForIntermediatePersists in various task tuning configs to disable dimension/metric compression of intermediate persisted segments to reduce memory required to merge those in to final segment to be published.
This might increase the page cache usage of intermediate segments while they are used for querying, but overall it would be useful in cases where peak memory utilization of indexing task process during final merge is much higher than average memory utilization during ingestion.
To disable dimension/metric compression on intermediate persisted segments, following can be added to task tuningConfig.

        "indexSpecForIntermediatePersists": {
          "dimensionCompression": "uncompressed",
          "metricCompression": "none"
        }

clintropolis · 2019-06-18T19:45:25Z

Hmm, I really like the idea of being able to separately control this stuff for the intermediary segments so mega +1 on that, however I'm not sure how I feel about straight UNCOMPRESSED being the default behavior (if I understand this PR correctly). I think we should consider if this is the best thing to do to avoid the unbounded usage of the 64k processing buffers used for decompression, and maybe we should measure some things? My fear is this as the default could potentially change the dynamic of realtime indexing tasks quite a lot, namely how much memory impact they have on page cache, potentially exaggerating issues like described in #6699 (though I suspect running realtime tasks via YARN is rare-ish).

Experimentation I did related to #6016 showed an often very dramatic size difference between compressed and non-compressed data, particularly with int and long typed columns that I don't think can be ignored. Even the size difference between using CompressedVSizeByte and VSizeByte versions of int columns could be very large. I will see if I can dig up some measurements where i collected uncompressed and/or vsize byte packed sizes and share them here.

Some other ideas have been suggested to help mitigate merging issues like this I think. #5526 suggests adjustments to the merging algorithm that would reduce overall memory usage such as producing the dictionary out of band, and reducing the duplicated number of operations which I would suspect would reduce overall memory usage.

#7900 additionally suggests another thing we could do that would specifically help the unbounded decompression buffer on merge issue, in the form of doing a sort of hierarchical merge. Since we can measure how many intermediary segments we have with column counts for each, we could calculate how much buffer required and divide merge work up as necessary to keep total usage at a reasonable size. If I understand correctly, it also suggests some similar reworking of merge algorithm as mentioned in #5526.

The other thing that makes me think this might not be the best default behavior at least, is that to simply things for new users, the documentation for getting started and smaller cluster tuning suggests running co-tenant middle-manager and historical processes, and I suspect if uncompressed columns size differences are noticeable that this will greatly increase the amount of contention of the os free space between these processes, especially at merge time where all columns of all intermediary segments will be paged in during merge (#6699 again, some related discussion on this in comments). This is already a thing that bothers me about this setup, and I think it could make this issue become a lot worse.

It's also totally possible i'm being overly cautious, but I think we need more data before going with these defaults.

clintropolis · 2019-06-18T20:27:45Z

I want to add that it's also entirely possible I'm overestimating how effective compression is for these intermediary segments, especially at very small intermediate segment sizes occurring in the issue referenced this change, which are likely to have fewer opportunities for lz4 to take advantage of. My main point I guess is just that we should investigate the matter to make sure things are going to be chill if we do make this change.

himanshug · 2019-06-19T03:02:44Z

@clintropolis all the PRs you linked are independent improvements on indexing, index merging, compression, freeing buffers as soon as possible etc which are great and will happen at some point.
This PR is an immediate solution to address the problem. So, I am glad that you agree that having different IndexSpec for intermediate persisted segments makes sense.

Now the real question that affects this PR is whether to change default IndexSpec for intermediate persisted segments or not.

From code perspective it is fairy trivial to retain existing default behavior and I can make that change if that is what most people desire, however here is my rationale for changing the default.

My observation on different clusters have been that current indexing task process peak memory usage is much higher compared to average utilization during ingestion due to merge process at the time of publish. Due to that, users plan for task process to have the "peak" memory available to it throughout its lifetime.
When compression is disabled on intermediate segments, then average memory utilization would increase (more page cache used) but overall peak memory usage would decrease due to no decompression buffers allocated at time of merge. Also, queries would run faster because data is stored uncompressed.
All said, you are right as above assumptions would not hold on some clusters due to specifics of the datasets that these configurations depend upon , there is no single choice that is good for everyone (or else we wouldn't have those configs :) )
If we keep existing behavior as default then I'm afraid there would be very few cluster operators who will use the config introduced here to disable compression on intermediate segments. OTOH , with changed default behavior it would be improve things in most cases and where not, they can use the config to get back older behavior.
Or maybe I am totally wrong and problems would show up on different test clusters upgrading to RCs and we will re-instate the old behavior as default in a follow-up PR.

clintropolis · 2019-06-19T10:30:10Z

Ah, I agree with you on all points about merge being a peak of like everything 😅

When compression is disabled on intermediate segments, then average memory utilization would increase (more page cache used) but overall peak memory usage would decrease due to no decompression buffers allocated at time of merge.

This is all I really want to investigate a bit deeper before we go with these defaults, because my gut thinks that it would only be lower peak usage in the cases where there are very large numbers of small segments with lots of columns, because it really takes a lot before the 64KB buffers start to become a significant user of direct memory. I think if this is not the typical merge, then peak memory usage but in the form pressure on os free space, could be a fair bit higher with uncompressed intermediary segments, because merging will need to hit all pages essentially cycling the entire uncompressed segment through free space. If the middle manager has the free space the spare, or sufficiently fast disks there is certainly no issue, but this could be pretty unchill if running co-tenant with a historical, or if free space is tight enough perhaps push into disk becoming the bottleneck due to more faults?

I hopefully should have some time to fully review this PR soon, I will try to measure some things to see if I can make myself feel better about this being default 👍

himanshug · 2019-06-20T06:47:14Z

@clintropolis you are right about cases where it is possible that peak usage is not much different from average usage. I appreciate you trying it out on the realistic datasets you might have from your customers.
that said, I am now strongly considering changing the default behavior to be same as before just to be on the safe side and to remove element of surprise.

gianm · 2019-06-27T22:16:56Z

This config is intentionally undocumented to not complicate user documentation as I don't foresee it being used really. Most of the changes in patch are actually to add this config.

Now that we are thinking of retaining the existing default, would it make sense to add documentation?

himanshug · 2019-06-30T23:25:54Z

yeah I need to update this PR to retain older behavior as default... will get to it this week.

…gments

himanshug · 2019-07-05T06:57:30Z

@gianm @clintropolis this now retains existing behavior as default and newly added property is documented. I still kept the release notes label (and section in PR description) to add it in release notes to inform users who might need it.

nishantmonu51

generally looks good to me, would be great if you can also add some serde tests, e.g KafkaSupervisorTuningConfigTest to verify serde with new ppt.

himanshug · 2019-07-08T21:46:06Z

@nishantmonu51 updated the serde tests

himanshug · 2019-07-08T21:46:39Z

"packaging build" appears to be failing for unknown/unrelated reason... probably timing out or something.

clintropolis

lgtm 👍

clintropolis · 2019-07-09T09:30:16Z

"packaging build" appears to be failing for unknown/unrelated reason... probably timing out or something.

I think this is related to #7998, I believe @jihoonson is working on a fix.

jihoonson · 2019-07-09T14:47:47Z

Oops. Yes, it’s because of #7998 and I’m still debugging. Will try to fix by today.

himanshug · 2019-07-10T19:22:14Z

@clintropolis @jihoonson thanks for the build fix.

disable all compression in intermediate segment persists while ingestion

b97e26e

himanshug added Performance Area - Batch Ingestion Release Notes Area - Streaming Ingestion labels Jun 18, 2019

himanshug mentioned this pull request Jun 18, 2019

Kafka index service use a lot of direct memory during segment publish #7824

Closed

jihoonson added the Design Review label Jun 18, 2019

more changes and build fix

055b118

himanshug added 3 commits July 4, 2019 21:33

Merge remote-tracking branch 'apache/master' into updates

7968974

by default retain existing indexingSpec for intermediate persisted se…

7a37f09

…gments

document indexSpecForIntermediatePersists index tuning config

41564c4

himanshug changed the title ~~disable all compression in intermediate segment persists while ingestion~~ add config to optionally disable all compression in intermediate segment persists while ingestion Jul 5, 2019

fix build issues

6091228

nishantmonu51 reviewed Jul 8, 2019

View reviewed changes

himanshug added 2 commits July 8, 2019 12:42

Merge remote-tracking branch 'apache/master' into updates

1fd2584

update serde tests

39b3c6a

clintropolis approved these changes Jul 9, 2019

View reviewed changes

himanshug merged commit 14aec7f into apache:master Jul 10, 2019

clintropolis mentioned this pull request Jul 10, 2019

fix master branch build #8057

Merged

clintropolis added this to the 0.16.0 milestone Aug 8, 2019

clintropolis mentioned this pull request Aug 21, 2019

0.16.0-incubating release notes #8369

Closed

jihoonson mentioned this pull request Oct 15, 2020

Configurable Index Type #10335

Merged

8 tasks

Conversation

himanshug commented Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Jun 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Jun 18, 2019

Uh oh!

himanshug commented Jun 19, 2019

Uh oh!

clintropolis commented Jun 19, 2019

Uh oh!

himanshug commented Jun 20, 2019

Uh oh!

gianm commented Jun 27, 2019

Uh oh!

himanshug commented Jun 30, 2019

Uh oh!

himanshug commented Jul 5, 2019

Uh oh!

nishantmonu51 left a comment

Choose a reason for hiding this comment

Uh oh!

himanshug commented Jul 8, 2019

Uh oh!

himanshug commented Jul 8, 2019

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

clintropolis commented Jul 9, 2019

Uh oh!

jihoonson commented Jul 9, 2019

Uh oh!

himanshug commented Jul 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

himanshug commented Jun 18, 2019 •

edited

Loading

clintropolis commented Jun 18, 2019 •

edited

Loading