Skip to content

Kill tasks honor the buffer period of unused segments#15710

Merged
abhishekrb19 merged 19 commits intoapache:masterfrom
abhishekrb19:kill_buffer_period_bug
Jan 19, 2024
Merged

Kill tasks honor the buffer period of unused segments#15710
abhishekrb19 merged 19 commits intoapache:masterfrom
abhishekrb19:kill_buffer_period_bug

Conversation

@abhishekrb19
Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 commented Jan 17, 2024

Problem Description

With #12599, operators can configure druid.coordinator.kill.bufferPeriod,
allowing the KillUnusedSegments duty to utilize the buffer period to find candidate unused segments.

  • The coordinator duty KillUnusedSegments determines an umbrella interval
    for each datasource to determine the kill interval. The umbrella interval can be a widened
    interval and there can be multiple unused segments in this interval with different
    used_status_last_updated timestamps. For example, consider two unused segments in the
    interval [2023-01-01/2023-01-05] that were marked as unused at different times, one that
    is 30 days old and another that is 1 hour old.

  • However, when a kill task is instantiated with this umbrella interval, it’d kill all the unused segments
    found in the interval regardless of the last updated timestamp. In the above example, the kill task
    after the 30-day mark would kill both the unused segments and not retain the 1-hour old one.

Fix

Pass in the maxUsedFlagLastUpdatedTime to the kill task and the unused segments retrieve task action.
Adjust the metadata SQL query to account for used_status_last_updated <= maxUsedStatusLastUpdatedTime when maxUsedFlagLastUpdatedTime is non-null.

Release note

  • Resolved an issue where the auto-kill feature failed to honor the specified buffer period. This occurred when multiple unused segments within an interval were marked as unused at different times.
  • Users can submit kill tasks with an optional parameter maxUsedStatusLastUpdatedTime. When set to a date time, the kill task considers segments in the specified interval marked as unused no later than this time. The default behavior is to kill all unused segments in the interval regardless of the time when segments where marked as unused.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

- The coordinator duty KillUnusedSegments determines an umbrella interval
 for each datasource to determine the kill interval. There can be multiple unused
segments in an umbrella interval with different used_status_last_updated timestamps.
For example, consider an unused segment that is 30 days old and one that is 1 hour old. Currently
the kill task after the 30-day mark would kill both the unused segments and not retain the 1-hour
old one.

- However, when a kill task is instantiated with this umbrella interval, it’d kill
all the unused segments regardless of the last updated timestamp. We need kill
tasks and RetrieveUnusedSegmentsAction to honor the bufferPeriod to avoid killing
unused segments in the kill interval prematurely.
@abhishekrb19 abhishekrb19 force-pushed the kill_buffer_period_bug branch 3 times, most recently from f98d291 to f3f6236 Compare January 17, 2024 20:08
Assert.assertEquals(task.getId(), taskQuery.getId());
Assert.assertEquals(task.getDataSource(), taskQuery.getDataSource());
Assert.assertEquals(task.getInterval(), taskQuery.getInterval());
Assert.assertNull(taskQuery.getMarkAsUnused());

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [ClientKillUnusedSegmentsTaskQuery.getMarkAsUnused](1) should be avoided because it has been deprecated.
Comment thread server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java Outdated
This is consistent with the column name `used_status_last_updated`.
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left initial feedback, haven't gone through a few files yet.

Comment thread server/src/main/java/org/apache/druid/rpc/indexing/OverlordClient.java Outdated
Comment thread docs/operations/clean-metadata-store.md Outdated
Comment thread docs/operations/clean-metadata-store.md Outdated
abhishekrb19 and others added 7 commits January 17, 2024 20:21
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…data storage coordinator interface.

Removes the following interface methods in favor of a new method added:
- retrieveUnusedSegmentsForInterval(String, Interval)
- retrieveUnusedSegmentsForInterval(String, Interval, Integer)
@abhishekrb19 abhishekrb19 force-pushed the kill_buffer_period_bug branch from 7996dd6 to 0995ea0 Compare January 18, 2024 08:31
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @abhishekrb19 !

@abhishekrb19
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews, @zachjsh @kfaraz @jon-wei!

@abhishekrb19 abhishekrb19 merged commit 38c1def into apache:master Jan 19, 2024
@abhishekrb19 abhishekrb19 deleted the kill_buffer_period_bug branch January 19, 2024 06:23
@abhishekrb19 abhishekrb19 added this to the Druid 29.0.0 milestone Jan 19, 2024
LakshSingla pushed a commit to LakshSingla/druid that referenced this pull request Jan 31, 2024
* Kill tasks should honor the buffer period of unused segments.

- The coordinator duty KillUnusedSegments determines an umbrella interval
 for each datasource to determine the kill interval. There can be multiple unused
segments in an umbrella interval with different used_status_last_updated timestamps.
For example, consider an unused segment that is 30 days old and one that is 1 hour old. Currently
the kill task after the 30-day mark would kill both the unused segments and not retain the 1-hour
old one.

- However, when a kill task is instantiated with this umbrella interval, it’d kill
all the unused segments regardless of the last updated timestamp. We need kill
tasks and RetrieveUnusedSegmentsAction to honor the bufferPeriod to avoid killing
unused segments in the kill interval prematurely.

* Clarify default behavior in docs.

* test comments

* fix canDutyRun()

* small updates.

* checkstyle

* forbidden api fix

* doc fix, unused import, codeql scan error, and cleanup logs.

* Address review comments

* Rename maxUsedFlagLastUpdatedTime to maxUsedStatusLastUpdatedTime

This is consistent with the column name `used_status_last_updated`.

* Apply suggestions from code review

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

* Make period Duration type

* Remove older variants of runKilLTask() in OverlordClient interface

* Test can now run without waiting for canDutyRun().

* Remove previous variants of retrieveUnusedSegments from internal metadata storage coordinator interface.

Removes the following interface methods in favor of a new method added:
- retrieveUnusedSegmentsForInterval(String, Interval)
- retrieveUnusedSegmentsForInterval(String, Interval, Integer)

* Chain stream operations

* cleanup

* Pass in the lastUpdatedTime to markUnused test function and remove sleep.

---------

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
abhishekagarwal87 pushed a commit that referenced this pull request Jan 31, 2024
* Kill tasks should honor the buffer period of unused segments.

- The coordinator duty KillUnusedSegments determines an umbrella interval
 for each datasource to determine the kill interval. There can be multiple unused
segments in an umbrella interval with different used_status_last_updated timestamps.
For example, consider an unused segment that is 30 days old and one that is 1 hour old. Currently
the kill task after the 30-day mark would kill both the unused segments and not retain the 1-hour
old one.

- However, when a kill task is instantiated with this umbrella interval, it’d kill
all the unused segments regardless of the last updated timestamp. We need kill
tasks and RetrieveUnusedSegmentsAction to honor the bufferPeriod to avoid killing
unused segments in the kill interval prematurely.

* Clarify default behavior in docs.

* test comments

* fix canDutyRun()

* small updates.

* checkstyle

* forbidden api fix

* doc fix, unused import, codeql scan error, and cleanup logs.

* Address review comments

* Rename maxUsedFlagLastUpdatedTime to maxUsedStatusLastUpdatedTime

This is consistent with the column name `used_status_last_updated`.

* Apply suggestions from code review



* Make period Duration type

* Remove older variants of runKilLTask() in OverlordClient interface

* Test can now run without waiting for canDutyRun().

* Remove previous variants of retrieveUnusedSegments from internal metadata storage coordinator interface.

Removes the following interface methods in favor of a new method added:
- retrieveUnusedSegmentsForInterval(String, Interval)
- retrieveUnusedSegmentsForInterval(String, Interval, Integer)

* Chain stream operations

* cleanup

* Pass in the lastUpdatedTime to markUnused test function and remove sleep.

---------

Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants