Skip to content

Add embedded kill tasks that run on the Overlord#18028

Merged
kfaraz merged 6 commits intoapache:masterfrom
kfaraz:embedded_kill_tasks
Jun 3, 2025
Merged

Add embedded kill tasks that run on the Overlord#18028
kfaraz merged 6 commits intoapache:masterfrom
kfaraz:embedded_kill_tasks

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented May 21, 2025

Description

Kill tasks currently suffer from several drawbacks.

  • They take up a task slot.
  • They take more time than necessary.
    • The actual kill operation takes only a few seconds. The majority of the time is spent in
      spawning the peon and inter-process communication.
  • They are often not able to keep up with the unused segments in the cluster.
  • They are difficult to get right due to several configs like maxKillTaskSlotRatio, maxKillTaskSlots,
    kill.bufferPeriod, kill.durationToRetain, kill.period.

This patch adds an embedded mode of running kill tasks on the Overlord itself.

Solution: Embedded kill tasks

These embedded tasks

  • kill unused segments as soon as they become eligible for kill.
  • run on the Overlord and do not take up task slots.
  • finish faster as a separate task process is not involved.
  • kill a small number of segments per task, to ensure that locks on an interval are not held for too long.
  • skip locked intervals to avoid head-of-line blocking in kill tasks.
  • do not require any configuration
  • can keep up with a large number of unused segments in the cluster.

Items for follow up PR

  • Add integration tests
  • Docs changes

Design

Most of the heavy lifting of kill tasks is already done by the Overlord via task actions.
Moving kill of segments to the Overlord helps avoid unnecessary launching of tasks,
thus keeping more task slots available and also reduces inter-process communication.
The Overlord can also leverage the segment metadata cache for several of its operations
thus improving the performance further.

The only responsibility added to the Overlord would be deleting segment files from the deep store.

Implementation

  • Run a dedicated thread on the Overlord to kill segments as soon as they become eligible.
    • This avoids the need to define any kill period or kill task slot ratio.
  • Iterate over all eligible unused segment intervals of all datasources.
  • Use intervals separately and do not combine them into umbrella intervals.
  • Launch a kill task for each eligible interval and kill only upto 1000 segments in that interval.
    This helps ensure that locks are not held over any interval for too long.
  • Keep only two configs to simplify operation
    • druid.manager.segments.killUnused.enabled
    • druid.manager.segments.killUnused.bufferPeriod

Changes

Main classes to review

  • UnusedSegmentKiller: OverlordDuty that launches embedded kill tasks
  • UnusedSegmentKillerConfig with the following fields:
    • enabled: Turns on segment killer on the Overlord
    • bufferPeriod: Period for which segments are retained even after being marked as unused.
  • EmbeddedKillTask: extends KillUnusedSegmentsTask to modify some behaviour
  • KillTaskToolbox: simplified version of TaskToolbox to run embedded kill tasks

Other changes

  • Make minor changes to KillUnusedSegmentsTask to modify some behaviour for embedded tasks
  • Add methods to retrieve unused segments and intervals to IndexerMetadataStorageCoordinator
    and SqlSegmentsMetadataQuery
  • Add short-circuit to TaskLockbox.remove and make it idempotent

Follow up changes not included in this PR

Currently, a kill task logs an info message for every file removed from the deep storage.
For example, on S3:

log.info(
"Removing from bucket: [%s] the following index files: [%s] from s3!",
s3Bucket,
keysToDeleteStrings
);

While this is okay for a normal kill task, it becomes too verbose when running embedded kill tasks on the Overlord.
For embedded kill tasks, we should log only warnings or errors for paths that could not be deleted or were skipped for some reason.

This would require the following changes:

  • Update DataSegmentKiller interface to return details of deleted or skipped paths
  • Log results of DataSegmentKiller.kill in KillUnusedSegmentsTask
  • Suppress these logs in the embedded kill task

Release note

Add an embedded mode for running kill tasks for unused segments on the Overlord itself.

Advantages of embedded kill tasks

  • Kill unused segments as soon as they become eligible for kill.
  • Do not waste task slots on kill tasks
  • Embedded kill tasks finish faster
  • Skip locked intervals to avoid head-of-line blocking in kill tasks
  • No config required, just enable the feature
  • Able to keep up with a large number of unused segments in the cluster

New metrics

Metric Description Dimensions
segment/killed/metadataStore/count Number of segments permanently deleted from metadata store dataSource, taskId, taskType, groupId, tags
segment/killed/deepStorage/count Number of segments permanently deleted from deep storage dataSource, taskId, taskType, groupId, tags
segment/kill/queueReset/time Time taken to reset the kill queue on the Overlord. This metric is emitted only if druid.manager.segments.killUnused.enabled is true.
segment/kill/queueProcess/time Time taken to fully process all the jobs in the kill queue on the Overlord. This metric is emitted only if druid.manager.segments.killUnused.enabled is true.
segment/kill/jobsProcessed/count Number of jobs processed from the kill queue on the Overlord. This metric is emitted only if druid.manager.segments.killUnused.enabled is true.
segment/kill/skippedIntervals/count Number of intervals skipped from kill due to being already locked. This metric is emitted only if druid.manager.segments.killUnused.enabled is true. dataSource, taskId

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz changed the title [WIP] Add embedded kill tasks that run on the Overlord Add embedded kill tasks that run on the Overlord May 21, 2025
@FrankChen021 FrankChen021 requested a review from Copilot May 23, 2025 08:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds embedded segment kill functionality to the Overlord to improve performance and reduce task-slot usage by running kill operations in-process rather than spawning separate tasks.

  • Extended metadata queries (SqlSegmentsMetadataQuery, IndexerSQLMetadataStorageCoordinator, IndexerMetadataStorageCoordinator) to fetch unused segments and intervals with new filter and paging methods.
  • Enhanced configuration (SegmentsMetadataManagerConfig) to include a new killUnused section with validation, and updated core interfaces to return deletion counts.
  • Introduced in-process kill tasks (UnusedSegmentsKiller, KillTaskToolbox, EmbeddedKillTask) and updated scheduling (OverlordDutyExecutor), locking (TaskLockbox), and reporting (KillTaskReport, TaskMetrics).

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java Added methods to query unused segments and intervals; improved docs
server/src/main/java/org/apache/druid/metadata/SegmentsMetadataManagerConfig.java Added killUnused config field and validation in constructor
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java Proxy methods for new SQL queries; updated deleteSegments to return count
server/src/main/java/org/apache/druid/indexing/overlord/IndexerMetadataStorageCoordinator.java Updated interface to support new kill and interval methods
processing/src/main/java/org/apache/druid/indexing/overlord/report/KillTaskReport.java Changed getPayload() return type from Object to Stats
indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java Made remove(...) idempotent and short-circuit when appropriate
indexing-service/src/main/java/org/apache/druid/indexing/overlord/duty/KillTaskToolbox.java New toolbox class for embedded kill tasks
indexing-service/src/main/java/org/apache/druid/indexing/overlord/duty/OverlordDutyExecutor.java Use simple class name for logs and skip zero-period duties
Comments suppressed due to low confidence (2)

processing/src/main/java/org/apache/druid/indexing/overlord/report/KillTaskReport.java:59

  • Changed getPayload() return type from Object to Stats, which may break existing consumers expecting an Object. Consider adding a deprecated public Object getPayload() forwarder or otherwise preserving backward compatibility.
public Stats getPayload()

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java:1235

  • Returning early in remove(...) skips cleanupUpgradeAndPendingSegments and unlockAll, which can leave segment locks or pending state uncleared. Ensure cleanup and unlock logic always executes before returning.
if (!activeTasks.contains(task.getId())) {

* @param maxUpdatedTime Returned segments must have a {@code used_status_last_updated}
* which is either null or earlier than this value.
*/
public List<DataSegment> findUnusedSegments(
Copy link

Copilot AI May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc for findUnusedSegments does not include a @param dataSource description. Please add a @param dataSource entry to accurately document the parameter.

Copilot uses AI. Check for mistakes.
@FrankChen021
Copy link
Copy Markdown
Member

I have not review the changes carefully, but leave one question first, some kill tasks might take very long time(for example, up to dozens of minutes) to delete segments for large data source, if there're several such kill tasks run together, wil these kill tasks block the overlord's other duties?

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented May 24, 2025

I have not review the changes carefully, but leave one question first, some kill tasks might take very long time(for example, up to dozens of minutes) to delete segments for large data source, if there're several such kill tasks run together, wil these kill tasks block the overlord's other duties?

Thanks for the comment, @FrankChen021 !

A couple of clarifications on the above:

  • When embedded kill tasks are running on the Overlord, it is recommended to NOT launch kill tasks
    manually or from the coordinator duty. The current implementation does not do any validation around
    this. But we can perhaps give a warning message when submitting a normal kill task.
  • Embedded kill tasks run on a single thread on the Overlord. So, there can be only one.
  • Since it is a dedicated thread, it does not interfere with other Overlord activities.
  • It cannot take a long time even for large datasources since each embedded kill task will
    only ever kill at most 1000 segments, all having the same start and end time.
    So the total time would be just the time taken to delete 1000 segments from metadata store and deep storage.
  • Additionally, if a single kill task still somehow gets stuck, there is a management ping sent from OverlordDutyExecutor
    that will cancel that task and move on to the next one.

Thanks for bringing this up, I will also add the above points to the PR description (and perhaps some points
to the code javadoc as well) to help future readers.

@capistrant
Copy link
Copy Markdown
Contributor

@kfaraz thanks for the detailed PR description and answer to Franks question! I plan to review in full today (05/28)

Copy link
Copy Markdown
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran out of time before a hard stop at the end of my day so I wasn't able to complete a full review. Publishing my pending comments for now and will finish reviewing tomorrow. I'm overall supportive of the design and think this will be cool.

updateStateIfNewLeader();
if (shouldResetKillQueue()) {
// Clear the killQueue to stop further processing of already queued jobs
killQueue.clear();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this duplicative since resetKillQueue() also calls clear()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has two benefits:

  • We clear the queue proactively thus stopping further processing right here.
  • There is no interleaving between submissions of startNextJobInKillQueue and resetKillQueue i.e.
    at any given point, the exec either has a single startNextJobInKillQueue or a single resetKillQueue
    in its executor queue. This keeps the semantics simple and easy to debug and reason about.

"Failed while processing kill jobs. There are [%d] jobs in queue.",
killQueue.size()
);
startNextJobInKillQueue();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the reasoning behind behind this calling itself on a caught exception? I'm trying to understand what kind of risks there are in doing this, if any. Could poll continuously throw an exception leaving us in an infinite loop since the queue will never be drawn down to hit the isEmpty conditional. Even if it is legit, a short comment could help folks who are confused by it like I am

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!

I wanted to continue with the next job in case a specific kill task throws an exception.
But we are already handling that in runKillTask.

I have updated this.

{
if (isEnabled()) {
// Check every hour that the kill queue is being processed normally
return new DutySchedule(Duration.standardHours(1).getMillis(), Duration.standardMinutes(1).getMillis());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be configurable? I ask only because I interpret this duty as if it runs through the 1k kill queue in less than 1 hour won't it be idle until the next run? If the metric for queue processing duration is constantly under an hour for a cluster, the operator may want to increase frequency?

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was as follows:

Case 1: We are able to clear the queue within an hour
This implies that there are not too many unused segments in the cluster anyway and we are in a good place.
Chances are next reset will not have a lot of new unused segments to kill off anyway.
Also note that resetting the queue every hour is still much more frequent than the current Coordinator duty
setup, since even though that duty would typically run every 30 mins or so, it would queue up only a
handful of kill tasks (limited by task slots).

Case 2: Clearing the queue takes longer than an hour
In this case, we already have our hands full and checking every hour is frequent enough.

Note: 1k is only the initial size of the queue, it can have more elements in practice.
I have added a UT that launches a large number of kill tasks to clarify this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the info, makes sense

if (isEnabled()) {
this.exec = executorFactory.create(1, "UnusedSegmentsKiller-%s");
this.killQueue = new PriorityBlockingQueue<>(
1000,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this kept as a non-configurable for operations simplicity?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

This is only the initial capacity of the queue, it can scale up to accommodate more entries as necessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:doh: I shoulda known this is just initial capacity

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😃

Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this feature, @kfaraz! +1 to the idea as it's a clear step up from the coordinator-based kill duty for the reasons you mentioned. I’ve left a few comments on the implementation.

Do you think it makes sense to eventually deprecate the coordinator-based duty in favor of the Overlord one? On a similar note, are there other duties that the Overlord can start taking ownership of when the incremental segment cache is enabled?

this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {
throw InvalidInput.exception(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please include the runtime properties in the error message to make corrective action easier; something like Please set "druid.manager.segments.useIncrementalCache=true" when ....
  2. Should the target persona in this case be an operator rather than an end user?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines +316 to +318
catch (IOException e) {
throw DruidException.defensive(e, "Error while reading unused segments");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think an IOException should be a developer-facing defensive exception. I think throwing a RuntimeException or an equivalent DruidException would work better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +52 to +53
* If this returns null, segments are never killed by the {@code UnusedSegmentKiller}
* but they might still be killed by the Coordinator.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this actually be null since the default in the ctr is set to 90 days?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

this.pollDuration = Configs.valueOrDefault(pollDuration, Period.minutes(1));
this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about how this overlord embedded kill tasks feature would interact with the coordinator kill duty. @kfaraz I see your comment on this:

When embedded kill tasks are running on the Overlord, it is recommended to NOT launch kill tasks
manually or from the coordinator duty. The current implementation does not do any validation around
this. But we can perhaps give a warning message when submitting a normal kill task.

I think having a validation makes sense and we can:

  • Fail fast if both the kill features on the Overlord and Coordinator are enabled
  • Log a warning in the kill task as you mention

For the validation, is it possible to bind the coordinator kill duty config so that the Overlord knows about it (or vice-versa)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the validation, is it possible to bind the coordinator kill duty config so that the Overlord knows about it (or vice-versa)?

If coordinator and overlord are different processes who may or may not be on the same server and may or may not use the same config files, wouldn't this be tough to reliably do?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, config-based validation would not be possible as the two can be separate processes.
The only validation we can do is not accept kill tasks submitted by the coordinator to the Overlord
(they would have the prefix coordinator-issued-kill). It might feel a bit hacky but it's the cleanest
option right now (short of exposing an API to read the config, which is overkill).

Also, as @capistrant points out, there is no inherent harm in running the two kill modes together
since each kill task (both embedded and normal) acquires an EXCLUSIVE lock on the interval.
Embedded kill tasks don't take up a task slot, so that should not be a concern either.

Also, users are always allowed to submit kill tasks manually. So several kill tasks for the same datasource
can be running concurrently anyway.

@capistrant , @abhishekrb19 , please let me know if we should add validation to not accept kill tasks submitted by the Coordinator if embedded kill is enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of rejecting tasks based on prefix. I think that since it is not a risk to the correctness/health of the underlying cluster, we should leave it be with good documentation that we suggest updating coordinator configs if using the new embedded kill.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I will add some docs to this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, documenting it seems enough in that case.

this.killConfig = config.getKillUnused();

if (isEnabled()) {
this.exec = executorFactory.create(1, "UnusedSegmentsKiller-%s");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a new feature, I think having an info log here is useful when this duty is enabled on the Overlord

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks for the suggestion!


public static final String RUN_DURATION = "task/run/time";

public static final String NUKED_SEGMENTS = "segment/killed/metadataStore/count";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid ambiguity and for consistency, consider renaming this to SEGMENTS_DELETED_FROM_METADATA_STORE or similar

Comment on lines 416 to 420
LOG.warn(
"Skipping kill of segments[%s] as its load spec is also used by segment IDs[%s].",
parentIdToUnusedSegments.get(parent), children
);
parentIdToUnusedSegments.remove(parent);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is expected during normal operations when concurrent append/replace is enabled, should this be logged at the info level instead of warn? We could also consider emitting a "skipped" metric with the reason as a dimension, if you think this is useful to track.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to postpone the skip metric for now?

I intend to touch this part of the code anyway in a follow up PR for log improvements in the DataSegmentKiller interface (details in the PR description above).

{
boolean isPresent = usedSegmentLoadSpecs.contains(segment.getLoadSpec());
if (isPresent) {
LOG.warn("Skipping kill of segment[%s] as its load spec is also used by other segments.", segment);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re warn vs info

@Rule
public TaskActionTestKit taskActionTestKit = new TaskActionTestKit();

private static final List<DataSegment> WIKI_SEGMENTS_1X10D =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1X10D in the variable got me thinking this was a roman numeral of sorts :) I see we use this convention in a few other places too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is meant to indicate 1 partition x 10 days.
Any other nomenclature seemed too verbose, so I went ahead with this.

Comment on lines +64 to +67
* {@link OverlordDuty} to delete unused segments from metadata store and the
* deep storage. Launches {@link EmbeddedKillTask}s to clean unused segments
* of a single datasource-interval.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to cross-link the coordinator-based KillUnusedSegments duty in this class and vice-versa.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me. I left just a couple more small comments of my own and think @abhishekrb19 left a review with good comments that I won't individually plus 1 on, but in general, support

}

@Test
public void test_run_killsSegmentUpdatedInFuture_ifBufferPeriodIsNegative()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a legitimate use case for a negative buffer period? glad you added a test for it, but it feels weird

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't recall why I had added it myself 😅, segment update times cannot be in the future and it doesn't make sense to have a negative buffer period anyway. Removing it.

)
{
this.enabled = Configs.valueOrDefault(enabled, false);
this.bufferPeriod = Configs.valueOrDefault(bufferPeriod, Period.days(90));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for picking 90 days as default? I see that the coordinator config for the same is 30 days per https://druid.apache.org/docs/latest/configuration/ ... As long as it is documented, I don't have any preference between the two but was curious.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a typo, fixing it.

@capistrant
Copy link
Copy Markdown
Contributor

  • When embedded kill tasks are running on the Overlord, it is recommended to NOT launch kill tasks
    manually or from the coordinator duty. The current implementation does not do any validation around
    this. But we can perhaps give a warning message when submitting a normal kill task.

What are the worst case side effects if a user has both coordinator and overlord kill running? or is submitting kills to coordinator while overlord kill is enabled? If feels like it will be hard to ensure druid users aren't doing this. We can heavily document the fact that it shouldn't be done. But short of having some centralized gate for who can do kills (metastore, zk or something), I don't see how we can prevent it as long as the ability to enable killing for both services exists at the same time.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented May 30, 2025

@capistrant , @abhishekrb19 , thanks for the thorough review!
I have tried to address your feedback. Please let me know if any further change is required.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented May 30, 2025

@kfaraz There is a follow up item listed in the description to limit logging to deletion failures/skips only:

While this is okay for a normal kill task, it becomes too verbose when running embedded kill tasks on the Overlord.
For embedded kill tasks, we should log only warnings or errors for paths that could not be deleted or were skipped for some reason.

There should be some message that is logged at INFO level for each segment that is deleted. Deleting a segment is a destructive operation that in some situations is not undoable. Operators need some logs when that happens, for situations where they wonder what happened to their segment files.

Ideally, there is exactly one message logged at INFO level for each segment (not zero, not more than one). Ideally that one message has both the segment ID and the storage location (S3/GCS/Azure/etc).

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented May 31, 2025

That's a fair point, @gianm . In that case, the current logging done the respective DataSegmentKiller impls should suffice.
If needed, we can improve upon them in follow up PRs.

With embedded kill tasks, the only drawback would be that these segment delete messages would flood the Overlord logs.
But I guess operators can always filter those out based on the logger name.

Another option could be to direct the task logs to a different log file, same as regular task logs
and back them up on deep storage. Overlord would then just log a summary of the task and not all the details.

Let me know what are your thoughts.

final Stopwatch resetDuration = Stopwatch.createStarted();
try {
killQueue.clear();
if (!leaderSelector.isLeader()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to add an early return in this new conditional block to prevent queue re-build?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for catching it!

this.pollDuration = Configs.valueOrDefault(pollDuration, Period.minutes(1));
this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the idea of rejecting tasks based on prefix. I think that since it is not a risk to the correctness/health of the underlying cluster, we should leave it be with good documentation that we suggest updating coordinator configs if using the new embedded kill.

).build()
);

final Set<Interval> sampleIntervals = intervals.stream().limit(5).collect(Collectors.toSet());
Copy link
Copy Markdown
Contributor

@capistrant capistrant Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change to using a sample of 5 have to potentially change back to the verbose interval set based on what we decide about Gian's note about all deletes being auditable in logs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segment IDs would already be logged by the tasks, but I guess we can fix this up too.
Especially since the number of intervals is typically always low.
Coordinator and Overlord always launch kill tasks for a single interval.
Tasks submitted manually are likely to have few intervals too.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 3, 2025

@capistrant , I have addressed the latest comments. Is it okay to keep the docs changes for a follow up PR?

Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks!

this.pollDuration = Configs.valueOrDefault(pollDuration, Period.minutes(1));
this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, documenting it seems enough in that case.

Comment on lines +239 to +247
final Set<String> dataSources = storageCoordinator.retrieveAllDatasourceNames();

final Map<String, Integer> dataSourceToIntervalCounts = new HashMap<>();
for (String dataSource : dataSources) {
storageCoordinator.retrieveUnusedSegmentIntervals(dataSource, MAX_INTERVALS_TO_KILL_IN_DATASOURCE).forEach(
interval -> {
dataSourceToIntervalCounts.merge(dataSource, 1, Integer::sum);
killQueue.offer(new KillCandidate(dataSource, interval));
}
Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fairness, the coordinator duty uses a CircularList to round robin through the kill datasources. But for the overlord duty, since there’s no constraint on task slots, it would potentially process all of them in a deterministic order. If there are many datasources with large numbers of unused segments intervals hitting the ~10K threshold, the ones later in the list could end up consistently deprioritized?

We can revisit this logic if needed, just trying to understand the queuing characteristics of this current logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kill queue prioritizes older intervals. So, once the intervals of a datasource become old enough, they will become top priority.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, thanks for the clarification.

Copy link
Copy Markdown
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jun 3, 2025

That's a fair point, @gianm . In that case, the current logging done the respective DataSegmentKiller impls should suffice. If needed, we can improve upon them in follow up PRs.

With embedded kill tasks, the only drawback would be that these segment delete messages would flood the Overlord logs. But I guess operators can always filter those out based on the logger name.

Another option could be to direct the task logs to a different log file, same as regular task logs and back them up on deep storage. Overlord would then just log a summary of the task and not all the details.

Let me know what are your thoughts.

I'm not really worried about there being too many logs. I think of this log message as part of the segment lifecycle: first a segment is allocated (if in append mode), then published (always), then marked unused (if no longer needed), then deleted (if killed). So these kill logs shouldn't be "flooding", in terms of volume, any more than logs related to allocation and publish are flooding. I think we already have at least one log message for each of those actions.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 3, 2025

Thanks for the clarification, @gianm !

@kfaraz kfaraz merged commit 3be94a0 into apache:master Jun 3, 2025
72 of 75 checks passed
@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 3, 2025

Thanks for the reviews, @abhishekrb19 , @capistrant !

I have merged the PR since the IT failures are unrelated and are already being fixed in another PR #18067 .

@kfaraz kfaraz deleted the embedded_kill_tasks branch June 3, 2025 18:48
@capistrant
Copy link
Copy Markdown
Contributor

@kfaraz with a release window coming up for Druid, lets make sure we get this documented soon so folks know how to use it in Druid 34, should they choose

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 11, 2025

Absolutely, @capistrant , thanks for the reminder!
I will put up a PR later this week.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 11, 2025

@capistrant , I have created #18124 for docs changes.

kfaraz added a commit that referenced this pull request Jun 12, 2025
Docs changes for #18028

- Document metrics and configs for embedded kill tasks
- Remove duplicate configs for Coordinator auto-kill from `data-management/delete.md`
- Fix up references
jtuglu1 pushed a commit to jtuglu1/druid that referenced this pull request Jun 17, 2025
Docs changes for apache#18028

- Document metrics and configs for embedded kill tasks
- Remove duplicate configs for Coordinator auto-kill from `data-management/delete.md`
- Fix up references
@capistrant capistrant added this to the 34.0.0 milestone Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants