add config to optionally disable all compression in intermediate segment persists while ingestion#7919
Conversation
|
Hmm, I really like the idea of being able to separately control this stuff for the intermediary segments so mega +1 on that, however I'm not sure how I feel about straight Experimentation I did related to #6016 showed an often very dramatic size difference between compressed and non-compressed data, particularly with int and long typed columns that I don't think can be ignored. Even the size difference between using Some other ideas have been suggested to help mitigate merging issues like this I think. #5526 suggests adjustments to the merging algorithm that would reduce overall memory usage such as producing the dictionary out of band, and reducing the duplicated number of operations which I would suspect would reduce overall memory usage. #7900 additionally suggests another thing we could do that would specifically help the unbounded decompression buffer on merge issue, in the form of doing a sort of hierarchical merge. Since we can measure how many intermediary segments we have with column counts for each, we could calculate how much buffer required and divide merge work up as necessary to keep total usage at a reasonable size. If I understand correctly, it also suggests some similar reworking of merge algorithm as mentioned in #5526. The other thing that makes me think this might not be the best default behavior at least, is that to simply things for new users, the documentation for getting started and smaller cluster tuning suggests running co-tenant middle-manager and historical processes, and I suspect if uncompressed columns size differences are noticeable that this will greatly increase the amount of contention of the os free space between these processes, especially at merge time where all columns of all intermediary segments will be paged in during merge (#6699 again, some related discussion on this in comments). This is already a thing that bothers me about this setup, and I think it could make this issue become a lot worse. It's also totally possible i'm being overly cautious, but I think we need more data before going with these defaults. |
|
I want to add that it's also entirely possible I'm overestimating how effective compression is for these intermediary segments, especially at very small intermediate segment sizes occurring in the issue referenced this change, which are likely to have fewer opportunities for lz4 to take advantage of. My main point I guess is just that we should investigate the matter to make sure things are going to be chill if we do make this change. |
|
@clintropolis all the PRs you linked are independent improvements on indexing, index merging, compression, freeing buffers as soon as possible etc which are great and will happen at some point. Now the real question that affects this PR is whether to change default IndexSpec for intermediate persisted segments or not. From code perspective it is fairy trivial to retain existing default behavior and I can make that change if that is what most people desire, however here is my rationale for changing the default. My observation on different clusters have been that current indexing task process peak memory usage is much higher compared to average utilization during ingestion due to merge process at the time of publish. Due to that, users plan for task process to have the "peak" memory available to it throughout its lifetime. |
|
Ah, I agree with you on all points about merge being a peak of like everything 😅
This is all I really want to investigate a bit deeper before we go with these defaults, because my gut thinks that it would only be lower peak usage in the cases where there are very large numbers of small segments with lots of columns, because it really takes a lot before the 64KB buffers start to become a significant user of direct memory. I think if this is not the typical merge, then peak memory usage but in the form pressure on os free space, could be a fair bit higher with uncompressed intermediary segments, because merging will need to hit all pages essentially cycling the entire uncompressed segment through free space. If the middle manager has the free space the spare, or sufficiently fast disks there is certainly no issue, but this could be pretty unchill if running co-tenant with a historical, or if free space is tight enough perhaps push into disk becoming the bottleneck due to more faults? I hopefully should have some time to fully review this PR soon, I will try to measure some things to see if I can make myself feel better about this being default 👍 |
|
@clintropolis you are right about cases where it is possible that peak usage is not much different from average usage. I appreciate you trying it out on the realistic datasets you might have from your customers. |
Now that we are thinking of retaining the existing default, would it make sense to add documentation? |
|
yeah I need to update this PR to retain older behavior as default... will get to it this week. |
|
@gianm @clintropolis this now retains existing behavior as default and newly added property is documented. I still kept the |
nishantmonu51
left a comment
There was a problem hiding this comment.
generally looks good to me, would be great if you can also add some serde tests, e.g KafkaSupervisorTuningConfigTest to verify serde with new ppt.
|
@nishantmonu51 updated the serde tests |
|
"packaging build" appears to be failing for unknown/unrelated reason... probably timing out or something. |
I think this is related to #7998, I believe @jihoonson is working on a fix. |
|
Oops. Yes, it’s because of #7998 and I’m still debugging. Will try to fix by today. |
|
@clintropolis @jihoonson thanks for the build fix. |
Motivation:
All Druid ingestion tasks consume excessive amount of off-heap memory during final merge of intermediate persisted segments. A good fraction of that memory goes towards holding many 64K decompression buffers, recently reported in #7824 and I have seen this behavior on many different clusters in the past. Most of these 64K decompression buffers can be avoided if compression was disabled on intermediate persisted segments which is not really required on those intermediate segments but on the final segment that gets pushed.
Changes Made:
This patch disables compression on intermediate persisted segments for all ingestion tasks. However, in case this causes problems, a configuration "indexSpecForIntermediatePersists" could be used inside the task tuningConfig to revert the behavior on specific clusters.This config is intentionally undocumented to not complicate user documentation as I don't foresee it being used really. Most of the changes in patch are actually to add this config.
This patch adds another tuning config "indexSpecForIntermediatePersists" that could be used to disable dimension/metric compression on intermediate persisted segments.
Testing:
Existing unit and integration tests should cover all changes here.
Release Notes:
This PR must be mentioned in the release notes to inform users about this change in behavior and specially to inform about "indexSpecForIntermediatePersists" config in case it needs to be reverted in certain case.It is now possible to include
indexSpecForIntermediatePersistsin various task tuning configs to disable dimension/metric compression of intermediate persisted segments to reduce memory required to merge those in to final segment to be published.This might increase the page cache usage of intermediate segments while they are used for querying, but overall it would be useful in cases where peak memory utilization of indexing task process during final merge is much higher than average memory utilization during ingestion.
To disable dimension/metric compression on intermediate persisted segments, following can be added to task
tuningConfig.