Refactor and add tests and metric to KillUnusedSegments duty (auto-kill)#15941
Refactor and add tests and metric to KillUnusedSegments duty (auto-kill)#15941abhishekrb19 merged 13 commits intoapache:masterfrom
Conversation
Initial commit with: - Bug fixes - auto-kill can throw NPE when there are no datasources present and defaults mismatch. - Add new stat for candidate segment intervals killed. - Move a couple of debug logs to info logs for improved visibility (should only log once per kill period). - Remove redundant checks for code readability. - Updated tests from using mocks (also the mocks weren't using last updated timestamp) and add more test coverage for different config parameters. - Add a couple of unit tests that are ignored for the eternity case to prove that the kill duty doesn't clean up segments with ALL grain or that end in DateTimes.MAX. - Migrate Druid exception from user to operator persona.
kfaraz
left a comment
There was a problem hiding this comment.
Overall, the changes make the code more readable. I couldn't find the main NPE bugfix amidst all the refactor. If possible, maybe create two separate PRs - one to add the metric and the bugfix and another for all the refactors?
I have left some comments even though the PR is still in draft, so that changes can be incorporated early.
Also, rename to disambiguate.
|
Thanks for the review, @kfaraz! I've addressed your comments and kept the new metric along with other minor functional changes in this patch. But let me know if you'd prefer to keep the changes separate. Re the NPE fix, my bad, it was from existing test code -- |
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the prompt response, @abhishekrb19 !
Left a few minor comments, rest of the changes look good.
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
- Clarify docs on eligibility. - Add test for multiple segments in the same interval. Clarify comment. - Remove log line from test. - Remove lastUpdatedDate = now.plus(10) from test.
| |`killTask/maxSlot/count`| Maximum number of task slots available for auto kill tasks in the auto kill run. | |Varies| | ||
| |`kill/task/count`| Number of tasks issued in the auto kill run. | |Varies| | ||
| |`kill/candidateUnusedSegments/count`|Number of candidate unused segments to be deleted from the metadata store in an auto kill run.|`dataSource`|Varies| | ||
| |`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies| |
There was a problem hiding this comment.
| |`kill/candidateUnusedSegments/count`|The number of candidate unused segments eligible for deletion from the metadata store during an auto kill run for a datasource.|`dataSource`|Varies| | |
| |`kill/eligibleUnusedSegments/count`|The number of unused segments of a datasource that are identified as eligible for deletion from the metadata store by the coordinator.|`dataSource`|Varies| |
kfaraz
left a comment
There was a problem hiding this comment.
Only 1 minor suggestion regarding docs and metric name. But that change can be done later and need not block this PR. It would trigger the whole CI pipeline all over again.
+1
|
Thanks, @kfaraz! I will add your rename suggestion in a follow-up patch. |
Address review comment from apache#15941 (comment)
Address review comment from apache#15941 (comment)
… & rename metric (#15977) * All segments stored in the same batch have the same created_date entry. In the absence of a group_id column, this metadata would allow us to easily reason about and troubleshoot ingestion-related issues. * Rename metric name and code references to eligibleUnusedSegments. Address review comment from #15941 (comment)
This PR makes a few functional and non-functional changes in the auto-kill
KillUnusedSegmentsduty.Functional changes:
kill/eligibleUnusedSegments/countwith dimensiondataSourcethat tracks the number of candidate unused segments per datasource per kill cycle. It's candidate because it's the kill task that ultimately kills the unused segments and the number can change.Refactoring changes:
@VisibleForTestingannotations as the test code now verifies the duty directly instead of mocks + various hooks into the code.Testing changes:
TestDruidCoordinatorConfig.DEFAULT_COORDINATOR_KILL_BUFFER_PERIODwas incorrectly set to 1 day instead of 90 days. While at it, I changed all the test coordinator defaults to match the same ISO 8601 period format asDruidCoordinatorConfig, so it's easy to spot check.DateTimes.MAX(outside the normal Datetime string comparison range): Auto-kill doesn't delete segments outside the range [0000-01-01/10000-01-01) #15951Motivation
In general, I found it difficult to track the progress of the auto-kill coordinator duty. Also, see some discussion in the community slack: https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1707941058448649.
My preliminary conclusion from looking at a couple of clusters is that the auto-kill defaults are overly conservative and need to be revisited as #15912 notes. I would like to address some of the concerns raised in #15911 and #15912 separately once we have some additional data and tests from this patch.
Release note
Add a new metric
kill/eligibleUnusedSegments/countwith dimensiondataSourcethat tracks the number of eligible unused segments of a datasource that are eligible for deletion by the coordinator.Key changed/added classes in this PR
KillUnusedSegments.javaKillUnusedSegmentsTest.javaThis PR has: