Remediate ingestion failures when number of segments in time period is larger than 32767 by dulu98Kurz · Pull Request #15090 · apache/druid

dulu98Kurz · 2023-10-04T17:51:32Z

Description

This PR is attempting to remediate the exception java.lang.IllegalArgumentException: fromKey > toKey when number of segments is larger than Java Short.MAX_VALUE 32767, without fully context on why we set a limit on number of segments in time period to be within range of Java Short, this is what I believe that could make ingestion keep going
I can send a different PR if it is appropriate to change the number of segments to be in range of Int intead of Short, which requires larger scope of changes.

Fixed the bug

#15091.

Renamed the class

None

Added a forbidden-apis entry ...

None

Release note

Prevent ingestion failures when number of segments in time period exceeds 32767

Key changed/added classes in this PR

OvershadowableManager

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

dulu98Kurz · 2023-10-04T21:01:53Z

Let me know if it is appropriate to refactor partitionId to Int instead of short, which I believe solve this problem more completely.

kfaraz

Thanks for the changes, @dulu98Kurz ! I have left some comments.

kfaraz · 2023-10-06T02:44:28Z

  {
-    final RootPartitionRange lowFench = new RootPartitionRange(partitionId, partitionId);
+    // remediate submap `fromKey > toKey` issue when partitionId overflows
+    final short partitionIdLowFence = partitionId < 0 ? Short.MAX_VALUE : partitionId;


I think just checking the argument in the constructor of RootPartitionRange is enough. Since RootPartitionRange does not accept startPartitionId or endPartitionId less than 0, we will not have a case where partitionId passed to this method is less than 0.
And even if it is, we should throw an exception rather than silently convert it to max value.

Thanks for checking on this! @kfaraz , the constructor of RootPartitionRange that takes short type of startPartitionId was defined as private and we are forced to use :

static RootPartitionRange of(int startPartitionId, int endPartitionId) { return new RootPartitionRange((short) startPartitionId, (short) endPartitionId); }

Since startPartitionId is in Integer range, when startPartitionId > Short.MAV_VALUE ( 32767 ), the casting from int -> short start producing negative number, for example

And the loop repeats when startPartitionId continued going because casting from Int to Short losses precision:

We happen to ran into this short overflow scenario described in #15091 and our ingestion task for new data was completely broken because of this, throwing exception would still make the ingestion fail.
Here I'm making startPartitionId to be Short.MAX_VALUE so that it won't produce java.lang.IllegalArgumentException: fromKey > toKey and broke ingestion when we do stateMap.subMap(lowFence, false, highFence, false) , this is just an remediation.

I believe a better way to handle this is to allow startPartitionId and endPartitionId to be integer and avoid the problematic precision-loss casting, I can send another PR to this solution if you can confirm why we had this short limit originally and it is appropriate to do so.

Best,
Dun

Hi @kfaraz thanks again for spending time on this!
Bumping again just in case you missed my last message , I‘m more than glad to if there’s anything need further clarification , I can also connect on a zoom call if it‘s convenient for you!

Best Regards,
Dun

@dulu98Kurz - I am not entirely sure but short was likely chosen to save on memory that storing these ids will take. 32K partitions in one single interval is too high. Can you describe a bit more as to how your cluster ends up in this situation and why is that a genuine scenario? In my experience, almost every time, an interval touching this high number means that compaction is not configured or ingestion is misconfigured.

Hi @abhishekagarwal87 , thanks for checking on this!
You are right our investigation suggesting both late-messages from upstream and compactions falling behind, specifically we found there were random late-messages mixed in the kafka topics, it keep adding tiny segments to finalized trunk and eventually goes beyond short range and broke live ingestion tasks of new data, setting rejection period was not ideal because it means we will lose data, and because compaction falling behind we can`t afford to wait for it to catch up , I end up hard deleting the problematic time-trunk and then I realized solely relying on compaction seems inadequate.

Admittedly it is not an ideal use-case for Druid to handle random late messages, but it was a really difficult choice when user had to chose between letting ingestion broke vs deleting problematic time trunk.

So instead of capping at short max, we can possibly cover the gap by:

Allowing partitionId goes into Int range

Logging error messages to strongly remind user we need to compact/reduce num of segments.

For users who do not have late messages or compaction issues, this change has no impact to them because they won't store more than short max segments anyway, so we don`t break the initial intension of saving on mem.

For users who actually can produce segments beyond short max, this will buy them more time to compact/reduc number of segments, which may eventually avoid the difficult situation above.

From code quality perspective, Short.toUnsignedInt is a precision-loss conversion and we used it in 2 files for 18 times, we can simplify the logic and improve readability if we change to int

Lastly, when partitionId is out of range, the logic we use to handle it right now is simply wrong:

final RootPartitionRange lowFench = new RootPartitionRange(partitionId, partitionId); final RootPartitionRange highFence = new RootPartitionRange(Short.MAX_VALUE, Short.MAX_VALUE); return stateMap.subMap(lowFench, false, highFence, false).entrySet().iterator();

For example when:

Then stateMap.subMap(lowFench, false, highFence, false) will return all entries instead of empty ...

If we are ok with remediation in this PR, we can proceed with merging, if we are OK with refactoring please allow me to send another PR to fix it more completely.

Adding the PR that refactor partitionId from short to int so that we can compare the scope of changes
#15116

dulu98Kurz · 2023-10-16T06:20:28Z

Hi @kfaraz @abhishekagarwal87 sorry for being verbose on thread, I hope I had our issue well described , please let me know your thoughts, I`m open to any solutions that can keep our ingestion alive without deleting historical data.

cryptoe · 2023-10-19T15:22:03Z

@dulu98Kurz Thank you for the patience. Since most of the committers are busy with druid 28 things there might be some delay.

I think we have a ugly fail safe here. What I would prefer is to change the exception message to something nicer so that the user does not need to read through the druid's code base.

Like @abhishekagarwal87 32K partitions per interval just seem to massive and is generally an issue where in either compaction is lagging behind or late data is coming.
If we change this variable to int, the ticking time bomb has a larger impact and can lead to a full cluster outage where the number of segments in the cluster balloons up in millions.

I think if we do want to change this to INT, we should add a guard rail to the number of partitions per interval from the implicit 32K to something larger maybe 50K ?

dulu98Kurz · 2023-10-20T01:51:10Z

@cryptoe thanks for checking on this!
Agreed, letting the segments number growing wildly to the Integer MAX would definitely accumulate a bigger problem.

I`ll update the PR to include a CAP of 50K or 65536(if we want to go by binary convention), also making the exception message more meaningful, will send update soon, thanks again for attention! @cryptoe

dulu98Kurz · 2023-10-21T08:10:10Z

Hi @cryptoe @abhishekagarwal87 @kfaraz , please find updates in PR (#15116) , it includes changes as per discussion:

More informative exception language
Refactored partitionId in RootPartitionRange from short to int, but with a max value of 65536 to prevent memory pressure out of control.
Removed unnecessary problematic short to int conversion.

Please let me know for any concerns, I`ll close this PR when #15116 is merged.

github-actions · 2024-03-07T00:13:10Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2024-04-05T00:16:31Z

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

Dun Lu added 3 commits October 3, 2023 20:17

fix typo

3afd97e

handle out-of-range partitionIds

345c4b5

format

ef527f6

dulu98Kurz changed the title ~~Dun bugfix statemap~~ Remediate ingestion failures when number of segments in time period is larger than 32767 Oct 4, 2023

add comments

85b08b0

kfaraz reviewed Oct 6, 2023

View reviewed changes

dulu98Kurz mentioned this pull request Oct 10, 2023

Resolve partitionId overflow issue when number of segments exceeds Java Short.MAX_VALUE in single time period #15116

Closed

9 tasks

dulu98Kurz mentioned this pull request Oct 25, 2023

Ingestion failure when number of segments in time trunk exceeded Short.MAX_VALUE #15091

Closed

github-actions Bot added the stale label Mar 7, 2024

github-actions Bot closed this Apr 5, 2024

Conversation

dulu98Kurz commented Oct 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fixed the bug

Renamed the class

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

Uh oh!

dulu98Kurz commented Oct 4, 2023

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

dulu98Kurz Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dulu98Kurz Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 Oct 9, 2023

Choose a reason for hiding this comment

Uh oh!

dulu98Kurz Oct 9, 2023

Choose a reason for hiding this comment

Uh oh!

dulu98Kurz Oct 9, 2023

Choose a reason for hiding this comment

Uh oh!

dulu98Kurz Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dulu98Kurz Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

dulu98Kurz commented Oct 16, 2023

Uh oh!

cryptoe commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dulu98Kurz commented Oct 20, 2023

Uh oh!

dulu98Kurz commented Oct 21, 2023

Uh oh!

github-actions Bot commented Mar 7, 2024

Uh oh!

github-actions Bot commented Apr 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dulu98Kurz commented Oct 4, 2023 •

edited

Loading

dulu98Kurz Oct 6, 2023 •

edited

Loading

dulu98Kurz Oct 9, 2023 •

edited

Loading

dulu98Kurz Oct 9, 2023 •

edited

Loading

cryptoe commented Oct 19, 2023 •

edited

Loading