Add embedded kill tasks that run on the Overlord by kfaraz · Pull Request #18028 · apache/druid

kfaraz · 2025-05-21T07:15:21Z

Description

Kill tasks currently suffer from several drawbacks.

They take up a task slot.
They take more time than necessary.
- The actual kill operation takes only a few seconds. The majority of the time is spent in
  spawning the peon and inter-process communication.
They are often not able to keep up with the unused segments in the cluster.
They are difficult to get right due to several configs like maxKillTaskSlotRatio, maxKillTaskSlots,
kill.bufferPeriod, kill.durationToRetain, kill.period.

This patch adds an embedded mode of running kill tasks on the Overlord itself.

Solution: Embedded kill tasks

These embedded tasks

kill unused segments as soon as they become eligible for kill.
run on the Overlord and do not take up task slots.
finish faster as a separate task process is not involved.
kill a small number of segments per task, to ensure that locks on an interval are not held for too long.
skip locked intervals to avoid head-of-line blocking in kill tasks.
do not require any configuration
can keep up with a large number of unused segments in the cluster.

Items for follow up PR

Add integration tests
Docs changes

Design

Most of the heavy lifting of kill tasks is already done by the Overlord via task actions.
Moving kill of segments to the Overlord helps avoid unnecessary launching of tasks,
thus keeping more task slots available and also reduces inter-process communication.
The Overlord can also leverage the segment metadata cache for several of its operations
thus improving the performance further.

The only responsibility added to the Overlord would be deleting segment files from the deep store.

Implementation

Run a dedicated thread on the Overlord to kill segments as soon as they become eligible.
- This avoids the need to define any kill period or kill task slot ratio.
Iterate over all eligible unused segment intervals of all datasources.
Use intervals separately and do not combine them into umbrella intervals.
Launch a kill task for each eligible interval and kill only upto 1000 segments in that interval.
This helps ensure that locks are not held over any interval for too long.
Keep only two configs to simplify operation
- druid.manager.segments.killUnused.enabled
- druid.manager.segments.killUnused.bufferPeriod

Changes

Main classes to review

UnusedSegmentKiller: OverlordDuty that launches embedded kill tasks
UnusedSegmentKillerConfig with the following fields:
- enabled: Turns on segment killer on the Overlord
- bufferPeriod: Period for which segments are retained even after being marked as unused.
EmbeddedKillTask: extends KillUnusedSegmentsTask to modify some behaviour
KillTaskToolbox: simplified version of TaskToolbox to run embedded kill tasks

Other changes

Make minor changes to KillUnusedSegmentsTask to modify some behaviour for embedded tasks
Add methods to retrieve unused segments and intervals to IndexerMetadataStorageCoordinator
and SqlSegmentsMetadataQuery
Add short-circuit to TaskLockbox.remove and make it idempotent

Follow up changes not included in this PR

Currently, a kill task logs an info message for every file removed from the deep storage.
For example, on S3:

druid/extensions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java

Lines 151 to 155 in 2b1f1fc

    
           log.info( 
        
               "Removing from bucket: [%s] the following index files: [%s] from s3!", 
        
               s3Bucket, 
        
               keysToDeleteStrings 
        
           );

While this is okay for a normal kill task, it becomes too verbose when running embedded kill tasks on the Overlord.
For embedded kill tasks, we should log only warnings or errors for paths that could not be deleted or were skipped for some reason.

This would require the following changes:

Update DataSegmentKiller interface to return details of deleted or skipped paths
Log results of DataSegmentKiller.kill in KillUnusedSegmentsTask
Suppress these logs in the embedded kill task

Release note

Add an embedded mode for running kill tasks for unused segments on the Overlord itself.

Advantages of embedded kill tasks

Kill unused segments as soon as they become eligible for kill.
Do not waste task slots on kill tasks
Embedded kill tasks finish faster
Skip locked intervals to avoid head-of-line blocking in kill tasks
No config required, just enable the feature
Able to keep up with a large number of unused segments in the cluster

New metrics

Metric	Description	Dimensions
`segment/killed/metadataStore/count`	Number of segments permanently deleted from metadata store	`dataSource`, `taskId`, `taskType`, `groupId`, `tags`
`segment/killed/deepStorage/count`	Number of segments permanently deleted from deep storage	`dataSource`, `taskId`, `taskType`, `groupId`, `tags`
`segment/kill/queueReset/time`	Time taken to reset the kill queue on the Overlord. This metric is emitted only if `druid.manager.segments.killUnused.enabled` is true.
`segment/kill/queueProcess/time`	Time taken to fully process all the jobs in the kill queue on the Overlord. This metric is emitted only if `druid.manager.segments.killUnused.enabled` is true.
`segment/kill/jobsProcessed/count`	Number of jobs processed from the kill queue on the Overlord. This metric is emitted only if `druid.manager.segments.killUnused.enabled` is true.
`segment/kill/skippedIntervals/count`	Number of intervals skipped from kill due to being already locked. This metric is emitted only if `druid.manager.segments.killUnused.enabled` is true.	`dataSource`, `taskId`

This PR has:

Copilot

Pull Request Overview

Adds embedded segment kill functionality to the Overlord to improve performance and reduce task-slot usage by running kill operations in-process rather than spawning separate tasks.

Extended metadata queries (SqlSegmentsMetadataQuery, IndexerSQLMetadataStorageCoordinator, IndexerMetadataStorageCoordinator) to fetch unused segments and intervals with new filter and paging methods.
Enhanced configuration (SegmentsMetadataManagerConfig) to include a new killUnused section with validation, and updated core interfaces to return deletion counts.
Introduced in-process kill tasks (UnusedSegmentsKiller, KillTaskToolbox, EmbeddedKillTask) and updated scheduling (OverlordDutyExecutor), locking (TaskLockbox), and reporting (KillTaskReport, TaskMetrics).

Reviewed Changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java	Added methods to query unused segments and intervals; improved docs
server/src/main/java/org/apache/druid/metadata/SegmentsMetadataManagerConfig.java	Added `killUnused` config field and validation in constructor
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java	Proxy methods for new SQL queries; updated deleteSegments to return count
server/src/main/java/org/apache/druid/indexing/overlord/IndexerMetadataStorageCoordinator.java	Updated interface to support new kill and interval methods
processing/src/main/java/org/apache/druid/indexing/overlord/report/KillTaskReport.java	Changed `getPayload()` return type from `Object` to `Stats`
indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java	Made `remove(...)` idempotent and short-circuit when appropriate
indexing-service/src/main/java/org/apache/druid/indexing/overlord/duty/KillTaskToolbox.java	New toolbox class for embedded kill tasks
indexing-service/src/main/java/org/apache/druid/indexing/overlord/duty/OverlordDutyExecutor.java	Use simple class name for logs and skip zero-period duties

Comments suppressed due to low confidence (2)

processing/src/main/java/org/apache/druid/indexing/overlord/report/KillTaskReport.java:59

Changed getPayload() return type from Object to Stats, which may break existing consumers expecting an Object. Consider adding a deprecated public Object getPayload() forwarder or otherwise preserving backward compatibility.

public Stats getPayload()

indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java:1235

Returning early in remove(...) skips cleanupUpgradeAndPendingSegments and unlockAll, which can leave segment locks or pending state uncleared. Ensure cleanup and unlock logic always executes before returning.

if (!activeTasks.contains(task.getId())) {

Copilot · 2025-05-23T08:52:26Z

+   * @param maxUpdatedTime Returned segments must have a {@code used_status_last_updated}
+   *                       which is either null or earlier than this value.
+   */
+  public List<DataSegment> findUnusedSegments(


The Javadoc for findUnusedSegments does not include a @param dataSource description. Please add a @param dataSource entry to accurately document the parameter.

FrankChen021 · 2025-05-23T08:52:59Z

I have not review the changes carefully, but leave one question first, some kill tasks might take very long time(for example, up to dozens of minutes) to delete segments for large data source, if there're several such kill tasks run together, wil these kill tasks block the overlord's other duties?

kfaraz · 2025-05-24T03:34:36Z

I have not review the changes carefully, but leave one question first, some kill tasks might take very long time(for example, up to dozens of minutes) to delete segments for large data source, if there're several such kill tasks run together, wil these kill tasks block the overlord's other duties?

Thanks for the comment, @FrankChen021 !

A couple of clarifications on the above:

When embedded kill tasks are running on the Overlord, it is recommended to NOT launch kill tasks
manually or from the coordinator duty. The current implementation does not do any validation around
this. But we can perhaps give a warning message when submitting a normal kill task.
Embedded kill tasks run on a single thread on the Overlord. So, there can be only one.
Since it is a dedicated thread, it does not interfere with other Overlord activities.
It cannot take a long time even for large datasources since each embedded kill task will
only ever kill at most 1000 segments, all having the same start and end time.
So the total time would be just the time taken to delete 1000 segments from metadata store and deep storage.
Additionally, if a single kill task still somehow gets stuck, there is a management ping sent from OverlordDutyExecutor
that will cancel that task and move on to the next one.

Thanks for bringing this up, I will also add the above points to the PR description (and perhaps some points
to the code javadoc as well) to help future readers.

capistrant · 2025-05-28T14:55:10Z

@kfaraz thanks for the detailed PR description and answer to Franks question! I plan to review in full today (05/28)

capistrant

I ran out of time before a hard stop at the end of my day so I wasn't able to complete a full review. Publishing my pending comments for now and will finish reviewing tomorrow. I'm overall supportive of the design and think this will be cool.

capistrant · 2025-05-28T21:04:17Z

+    updateStateIfNewLeader();
+    if (shouldResetKillQueue()) {
+      // Clear the killQueue to stop further processing of already queued jobs
+      killQueue.clear();


is this duplicative since resetKillQueue() also calls clear()

This has two benefits:

We clear the queue proactively thus stopping further processing right here.

There is no interleaving between submissions of startNextJobInKillQueue and resetKillQueue i.e.
at any given point, the exec either has a single startNextJobInKillQueue or a single resetKillQueue
in its executor queue. This keeps the semantics simple and easy to debug and reason about.

capistrant · 2025-05-28T21:28:43Z

+          "Failed while processing kill jobs. There are [%d] jobs in queue.",
+          killQueue.size()
+      );
+      startNextJobInKillQueue();


Could you explain the reasoning behind behind this calling itself on a caught exception? I'm trying to understand what kind of risks there are in doing this, if any. Could poll continuously throw an exception leaving us in an infinite loop since the queue will never be drawn down to hit the isEmpty conditional. Even if it is legit, a short comment could help folks who are confused by it like I am

Thanks for pointing this out!

I wanted to continue with the next job in case a specific kill task throws an exception.
But we are already handling that in runKillTask.

I have updated this.

capistrant · 2025-05-28T21:51:35Z

+  {
+    if (isEnabled()) {
+      // Check every hour that the kill queue is being processed normally
+      return new DutySchedule(Duration.standardHours(1).getMillis(), Duration.standardMinutes(1).getMillis());


should this be configurable? I ask only because I interpret this duty as if it runs through the 1k kill queue in less than 1 hour won't it be idle until the next run? If the metric for queue processing duration is constantly under an hour for a cluster, the operator may want to increase frequency?

My reasoning was as follows:

Case 1: We are able to clear the queue within an hour
This implies that there are not too many unused segments in the cluster anyway and we are in a good place.
Chances are next reset will not have a lot of new unused segments to kill off anyway.
Also note that resetting the queue every hour is still much more frequent than the current Coordinator duty
setup, since even though that duty would typically run every 30 mins or so, it would queue up only a
handful of kill tasks (limited by task slots).

Case 2: Clearing the queue takes longer than an hour
In this case, we already have our hands full and checking every hour is frequent enough.

Note: 1k is only the initial size of the queue, it can have more elements in practice.
I have added a UT that launches a large number of kill tasks to clarify this.

thanks for the info, makes sense

capistrant · 2025-05-28T22:08:27Z

+    if (isEnabled()) {
+      this.exec = executorFactory.create(1, "UnusedSegmentsKiller-%s");
+      this.killQueue = new PriorityBlockingQueue<>(
+          1000,


Is this kept as a non-configurable for operations simplicity?

Yes.

This is only the initial capacity of the queue, it can scale up to accommodate more entries as necessary.

:doh: I shoulda known this is just initial capacity

abhishekrb19

Thanks for this feature, @kfaraz! +1 to the idea as it's a clear step up from the coordinator-based kill duty for the reasons you mentioned. I’ve left a few comments on the implementation.

Do you think it makes sense to eventually deprecate the coordinator-based duty in favor of the Overlord one? On a similar note, are there other duties that the Overlord can start taking ownership of when the incremental segment cache is enabled?

abhishekrb19 · 2025-05-29T15:57:07Z

    this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
+    this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
+    if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {
+      throw InvalidInput.exception(


Please include the runtime properties in the error message to make corrective action easier; something like Please set "druid.manager.segments.useIncrementalCache=true" when ....

Should the target persona in this case be an operator rather than an end user?

abhishekrb19 · 2025-05-29T15:59:28Z

+    catch (IOException e) {
+      throw DruidException.defensive(e, "Error while reading unused segments");
+    }


Hmm, I don't think an IOException should be a developer-facing defensive exception. I think throwing a RuntimeException or an equivalent DruidException would work better.

abhishekrb19 · 2025-05-29T16:05:12Z

+   * If this returns null, segments are never killed by the {@code UnusedSegmentKiller}
+   * but they might still be killed by the Coordinator.


Can this actually be null since the default in the ctr is set to 90 days?

abhishekrb19 · 2025-05-29T16:15:15Z

    this.pollDuration = Configs.valueOrDefault(pollDuration, Period.minutes(1));
    this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
+    this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
+    if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {


I was wondering about how this overlord embedded kill tasks feature would interact with the coordinator kill duty. @kfaraz I see your comment on this:

When embedded kill tasks are running on the Overlord, it is recommended to NOT launch kill tasks
manually or from the coordinator duty. The current implementation does not do any validation around
this. But we can perhaps give a warning message when submitting a normal kill task.

I think having a validation makes sense and we can:

Fail fast if both the kill features on the Overlord and Coordinator are enabled

Log a warning in the kill task as you mention

For the validation, is it possible to bind the coordinator kill duty config so that the Overlord knows about it (or vice-versa)?

For the validation, is it possible to bind the coordinator kill duty config so that the Overlord knows about it (or vice-versa)?

If coordinator and overlord are different processes who may or may not be on the same server and may or may not use the same config files, wouldn't this be tough to reliably do?

Yes, config-based validation would not be possible as the two can be separate processes.
The only validation we can do is not accept kill tasks submitted by the coordinator to the Overlord
(they would have the prefix coordinator-issued-kill). It might feel a bit hacky but it's the cleanest
option right now (short of exposing an API to read the config, which is overkill).

Also, as @capistrant points out, there is no inherent harm in running the two kill modes together
since each kill task (both embedded and normal) acquires an EXCLUSIVE lock on the interval.
Embedded kill tasks don't take up a task slot, so that should not be a concern either.

Also, users are always allowed to submit kill tasks manually. So several kill tasks for the same datasource
can be running concurrently anyway.

@capistrant , @abhishekrb19 , please let me know if we should add validation to not accept kill tasks submitted by the Coordinator if embedded kill is enabled.

I don't love the idea of rejecting tasks based on prefix. I think that since it is not a risk to the correctness/health of the underlying cluster, we should leave it be with good documentation that we suggest updating coordinator configs if using the new embedded kill.

Fair enough, I will add some docs to this PR.

Yeah, documenting it seems enough in that case.

abhishekrb19 · 2025-05-29T16:17:24Z

+    this.killConfig = config.getKillUnused();
+
+    if (isEnabled()) {
+      this.exec = executorFactory.create(1, "UnusedSegmentsKiller-%s");


Given that this is a new feature, I think having an info log here is useful when this duty is enabled on the Overlord

Added, thanks for the suggestion!

abhishekrb19 · 2025-05-29T16:41:21Z

+
+  public static final String RUN_DURATION = "task/run/time";
+
+  public static final String NUKED_SEGMENTS = "segment/killed/metadataStore/count";


To avoid ambiguity and for consistency, consider renaming this to SEGMENTS_DELETED_FROM_METADATA_STORE or similar

abhishekrb19 · 2025-05-29T16:45:28Z

+            LOG.warn(
+                "Skipping kill of segments[%s] as its load spec is also used by segment IDs[%s].",
+                parentIdToUnusedSegments.get(parent), children
+            );
            parentIdToUnusedSegments.remove(parent);


If this is expected during normal operations when concurrent append/replace is enabled, should this be logged at the info level instead of warn? We could also consider emitting a "skipped" metric with the reason as a dimension, if you think this is useful to track.

Is it okay to postpone the skip metric for now?

I intend to touch this part of the code anyway in a follow up PR for log improvements in the DataSegmentKiller interface (details in the PR description above).

abhishekrb19 · 2025-05-29T16:45:38Z

+  {
+    boolean isPresent = usedSegmentLoadSpecs.contains(segment.getLoadSpec());
+    if (isPresent) {
+      LOG.warn("Skipping kill of segment[%s] as its load spec is also used by other segments.", segment);


Same comment re warn vs info

abhishekrb19 · 2025-05-29T16:49:03Z

+  @Rule
+  public TaskActionTestKit taskActionTestKit = new TaskActionTestKit();
+
+  private static final List<DataSegment> WIKI_SEGMENTS_1X10D =


The 1X10D in the variable got me thinking this was a roman numeral of sorts :) I see we use this convention in a few other places too

Yeah, it is meant to indicate 1 partition x 10 days.
Any other nomenclature seemed too verbose, so I went ahead with this.

abhishekrb19 · 2025-05-29T16:51:58Z

+ * {@link OverlordDuty} to delete unused segments from metadata store and the
+ * deep storage. Launches {@link EmbeddedKillTask}s to clean unused segments
+ * of a single datasource-interval.
+ *


I think it would be good to cross-link the coordinator-based KillUnusedSegments duty in this class and vice-versa.

capistrant

Looking good to me. I left just a couple more small comments of my own and think @abhishekrb19 left a review with good comments that I won't individually plus 1 on, but in general, support

capistrant · 2025-05-29T20:20:27Z

+  }
+
+  @Test
+  public void test_run_killsSegmentUpdatedInFuture_ifBufferPeriodIsNegative()


Is there a legitimate use case for a negative buffer period? glad you added a test for it, but it feels weird

Yeah, I don't recall why I had added it myself 😅, segment update times cannot be in the future and it doesn't make sense to have a negative buffer period anyway. Removing it.

capistrant · 2025-05-29T20:24:46Z

+  )
+  {
+    this.enabled = Configs.valueOrDefault(enabled, false);
+    this.bufferPeriod = Configs.valueOrDefault(bufferPeriod, Period.days(90));


Any reason for picking 90 days as default? I see that the coordinator config for the same is 30 days per https://druid.apache.org/docs/latest/configuration/ ... As long as it is documented, I don't have any preference between the two but was curious.

Seems like a typo, fixing it.

capistrant · 2025-05-29T20:47:30Z

When embedded kill tasks are running on the Overlord, it is recommended to NOT launch kill tasks
manually or from the coordinator duty. The current implementation does not do any validation around
this. But we can perhaps give a warning message when submitting a normal kill task.

What are the worst case side effects if a user has both coordinator and overlord kill running? or is submitting kills to coordinator while overlord kill is enabled? If feels like it will be hard to ensure druid users aren't doing this. We can heavily document the fact that it shouldn't be done. But short of having some centralized gate for who can do kills (metastore, zk or something), I don't see how we can prevent it as long as the ability to enable killing for both services exists at the same time.

kfaraz · 2025-05-30T05:03:59Z

@capistrant , @abhishekrb19 , thanks for the thorough review!
I have tried to address your feedback. Please let me know if any further change is required.

gianm · 2025-05-30T15:58:13Z

@kfaraz There is a follow up item listed in the description to limit logging to deletion failures/skips only:

While this is okay for a normal kill task, it becomes too verbose when running embedded kill tasks on the Overlord.
For embedded kill tasks, we should log only warnings or errors for paths that could not be deleted or were skipped for some reason.

There should be some message that is logged at INFO level for each segment that is deleted. Deleting a segment is a destructive operation that in some situations is not undoable. Operators need some logs when that happens, for situations where they wonder what happened to their segment files.

Ideally, there is exactly one message logged at INFO level for each segment (not zero, not more than one). Ideally that one message has both the segment ID and the storage location (S3/GCS/Azure/etc).

kfaraz · 2025-05-31T06:40:09Z

That's a fair point, @gianm . In that case, the current logging done the respective DataSegmentKiller impls should suffice.
If needed, we can improve upon them in follow up PRs.

With embedded kill tasks, the only drawback would be that these segment delete messages would flood the Overlord logs.
But I guess operators can always filter those out based on the logger name.

Another option could be to direct the task logs to a different log file, same as regular task logs
and back them up on deep storage. Overlord would then just log a summary of the task and not all the details.

Let me know what are your thoughts.

capistrant · 2025-06-02T16:40:41Z

    final Stopwatch resetDuration = Stopwatch.createStarted();
    try {
      killQueue.clear();
+      if (!leaderSelector.isLeader()) {


did you mean to add an early return in this new conditional block to prevent queue re-build?

Ah, thanks for catching it!

capistrant · 2025-06-02T16:47:36Z

    this.pollDuration = Configs.valueOrDefault(pollDuration, Period.minutes(1));
    this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
+    this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
+    if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {


I don't love the idea of rejecting tasks based on prefix. I think that since it is not a risk to the correctness/health of the underlying cluster, we should leave it be with good documentation that we suggest updating coordinator configs if using the new embedded kill.

capistrant · 2025-06-02T16:48:59Z

          ).build()
      );
+
+      final Set<Interval> sampleIntervals = intervals.stream().limit(5).collect(Collectors.toSet());


Will this change to using a sample of 5 have to potentially change back to the verbose interval set based on what we decide about Gian's note about all deletes being auditable in logs?

The segment IDs would already be logged by the tasks, but I guess we can fix this up too.
Especially since the number of intervals is typically always low.
Coordinator and Overlord always launch kill tasks for a single interval.
Tasks submitted manually are likely to have few intervals too.

kfaraz · 2025-06-03T02:41:17Z

@capistrant , I have addressed the latest comments. Is it okay to keep the docs changes for a follow up PR?

abhishekrb19

Lgtm, thanks!

abhishekrb19 · 2025-06-03T03:10:38Z

    this.pollDuration = Configs.valueOrDefault(pollDuration, Period.minutes(1));
    this.useIncrementalCache = Configs.valueOrDefault(useIncrementalCache, SegmentMetadataCache.UsageMode.NEVER);
+    this.killUnused = Configs.valueOrDefault(killUnused, new UnusedSegmentKillerConfig(null, null));
+    if (this.killUnused.isEnabled() && this.useIncrementalCache == SegmentMetadataCache.UsageMode.NEVER) {


Yeah, documenting it seems enough in that case.

abhishekrb19 · 2025-06-03T03:22:12Z

+      final Set<String> dataSources = storageCoordinator.retrieveAllDatasourceNames();
+
+      final Map<String, Integer> dataSourceToIntervalCounts = new HashMap<>();
+      for (String dataSource : dataSources) {
+        storageCoordinator.retrieveUnusedSegmentIntervals(dataSource, MAX_INTERVALS_TO_KILL_IN_DATASOURCE).forEach(
+            interval -> {
+              dataSourceToIntervalCounts.merge(dataSource, 1, Integer::sum);
+              killQueue.offer(new KillCandidate(dataSource, interval));
+            }


For fairness, the coordinator duty uses a CircularList to round robin through the kill datasources. But for the overlord duty, since there’s no constraint on task slots, it would potentially process all of them in a deterministic order. If there are many datasources with large numbers of unused segments intervals hitting the ~10K threshold, the ones later in the list could end up consistently deprioritized?

We can revisit this logic if needed, just trying to understand the queuing characteristics of this current logic.

The kill queue prioritizes older intervals. So, once the intervals of a datasource become old enough, they will become top priority.

Ah, I see, thanks for the clarification.

capistrant

🚀

gianm · 2025-06-03T18:40:56Z

That's a fair point, @gianm . In that case, the current logging done the respective DataSegmentKiller impls should suffice. If needed, we can improve upon them in follow up PRs.

With embedded kill tasks, the only drawback would be that these segment delete messages would flood the Overlord logs. But I guess operators can always filter those out based on the logger name.

Another option could be to direct the task logs to a different log file, same as regular task logs and back them up on deep storage. Overlord would then just log a summary of the task and not all the details.

Let me know what are your thoughts.

I'm not really worried about there being too many logs. I think of this log message as part of the segment lifecycle: first a segment is allocated (if in append mode), then published (always), then marked unused (if no longer needed), then deleted (if killed). So these kill logs shouldn't be "flooding", in terms of volume, any more than logs related to allocation and publish are flooding. I think we already have at least one log message for each of those actions.

kfaraz · 2025-06-03T18:42:58Z

Thanks for the clarification, @gianm !

kfaraz · 2025-06-03T18:47:31Z

Thanks for the reviews, @abhishekrb19 , @capistrant !

I have merged the PR since the IT failures are unrelated and are already being fixed in another PR #18067 .

capistrant · 2025-06-11T13:53:19Z

@kfaraz with a release window coming up for Druid, lets make sure we get this documented soon so folks know how to use it in Druid 34, should they choose

kfaraz · 2025-06-11T14:09:36Z

Absolutely, @capistrant , thanks for the reminder!
I will put up a PR later this week.

kfaraz · 2025-06-11T16:52:37Z

@capistrant , I have created #18124 for docs changes.

Docs changes for #18028 - Document metrics and configs for embedded kill tasks - Remove duplicate configs for Coordinator auto-kill from `data-management/delete.md` - Fix up references

Docs changes for apache#18028 - Document metrics and configs for embedded kill tasks - Remove duplicate configs for Coordinator auto-kill from `data-management/delete.md` - Fix up references

Add embedded kill tasks that run on the Overlord

59b1c59

github-actions Bot added Area - Segment Format and Ser/De Area - Dependencies Area - Ingestion labels May 21, 2025

Revert extra changes

2520711

kfaraz changed the title ~~[WIP] Add embedded kill tasks that run on the Overlord~~ Add embedded kill tasks that run on the Overlord May 21, 2025

FrankChen021 requested a review from Copilot May 23, 2025 08:49

Copilot AI reviewed May 23, 2025

View reviewed changes

Cancel stuck task and add tests

3b72cdc

capistrant reviewed May 28, 2025

View reviewed changes

Add more unit tests

32b72c5

abhishekrb19 reviewed May 29, 2025

View reviewed changes

capistrant reviewed May 29, 2025

View reviewed changes

Minor cleanup

ce7ea43

capistrant reviewed Jun 2, 2025

View reviewed changes

Add javadoc, fix logs

f15c2c8

abhishekrb19 approved these changes Jun 3, 2025

View reviewed changes

capistrant approved these changes Jun 3, 2025

View reviewed changes

kfaraz merged commit 3be94a0 into apache:master Jun 3, 2025
72 of 75 checks passed

kfaraz deleted the embedded_kill_tasks branch June 3, 2025 18:48

kfaraz mentioned this pull request Jun 11, 2025

Docs: Add metrics and configs for embedded kill tasks #18124

Merged

10 tasks

capistrant added this to the 34.0.0 milestone Jul 22, 2025

	log.info(
	"Removing from bucket: [%s] the following index files: [%s] from s3!",
	s3Bucket,
	keysToDeleteStrings
	);

		* If this returns null, segments are never killed by the {@code UnusedSegmentKiller}
		* but they might still be killed by the Coordinator.


		public static final String RUN_DURATION = "task/run/time";

		public static final String NUKED_SEGMENTS = "segment/killed/metadataStore/count";

Conversation

kfaraz commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution: Embedded kill tasks

Items for follow up PR

Design

Implementation

Changes

Main classes to review

Other changes

Follow up changes not included in this PR

Release note

Advantages of embedded kill tasks

New metrics

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 23, 2025

Choose a reason for hiding this comment

Uh oh!

FrankChen021 commented May 23, 2025

Uh oh!

kfaraz commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

capistrant commented May 28, 2025

Uh oh!

capistrant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishekrb19 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kfaraz commented May 21, 2025 •

edited

Loading

kfaraz commented May 24, 2025 •

edited

Loading

kfaraz May 29, 2025 •

edited

Loading

kfaraz May 29, 2025 •

edited

Loading