Associate pending segments with the tasks that requested them by AmatyaAvadhanula · Pull Request #16144 · apache/druid

AmatyaAvadhanula · 2024-03-18T04:56:37Z

This PR aims to associate pending segments with the task groups that created them.

Motivation:

The association facilitates clean up of unneeded pending segments as soon as all tasks in a group exit.

Pending segment clean up from helps delete entries immediately after tasks exit and can alleviate the load on the metadata store during segment allocations. This can also help with segment allocation failures due to conflicting pending segments that are no longer needed in some cases.

This also allows a change in the way segment transactional append and replace actions to commit segments are used for concurrent append and replace.
A) When a concurrent replace occurs, it can upgrade any pending segments for concurrent append tasks to the replacing task's lock version.
B) When an appending task commits segments, it can not only commit the pending segments that it directly created but also their upgraded versions.

Previously, upgraded pending segments created by replace tasks could have different ids than the ones committed by the appending task finally. This doesn't affect batch appends.
However for concurrent streaming ingestion, there could be a race where an upgraded pending segment on the indexer would correspond to the same root segment as the parent of a different upgraded segment committed by the job.
In such a case, there could be a brief period when the upgraded realtime segment is being served, while the committed segment with a different id is also being served on a historical. Since their ids are distinct, the broker would allow both their results to be merged leading to data dupilcation in queries.

The change in protocol ensures that an append action upgrades a segment set which corresponds exactly to the pending segment upgrades made by the concurrent replace action, and eliminates any dupilcation in query results that may occur due to the above race.

Implementation details:

Changes to `druid_pendingSegments` metadata table:

group_id -> task replica group id for streaming ingestion
	             index_parallel task id for native batch ingestion
	             controller id for MSQ Insert

parent -> The pending segment using which the current entry was upgraded



* Pending segment clean up (Invoke on overlord directly instead of creating new task action)
- When a ParallelIndexSupervisorTask or MSQController task exits, clean all pending segments associated with its id. Useful in the case of task failures
- If a streaming ingestion job exits and there are no other active tasks corresponding to its base sequence name, clear all pending segments associated with it.


Changes to task actions:

* SegmentAllocateAction
- Associate pending segment with group_id at the time of pending segment write to metadata store. parent = self


* SegmentTransactionalReplace
- For a replace lock held over an interval:
    transaction {
      commit input segments contained within interval
      upgrade ids in the upgradeSegments table corresponding to this task to the replace lock's version and commit them
      fetch payload, group_id for pending segments 
      upgrade each such pending segment to the replace lock's version with the corresponding parent
    }
    For every pending segment with version == replace lock version:
	Fetch payload, group_id or the pending segment and relay them to the supervisor
	The supervisor relays the payloads to all the tasks with the corresponding group_id to serve realtime queries


* SegmentTransactionalAppend
- For an append lock held over an interval:
    transaction {
      commit input segments contained within interval
      if there is an active replace lock over the interval:
        add an entry for the inputSegment corresponding to the replace lock's task in the upgradeSegments table
      fetch pending segments with parent contained within the input segments, and commit them
    }

This PR has:

…with_tasks

abhishekrb19 · 2024-03-19T02:29:55Z

            )
        )
    );
+    alterPendingSegmentsTableAddParentIdAndTaskGroup(tableName);


It'd be better to move this function out to createPendingSegmentsTable() after the call to this method.

abhishekrb19 · 2024-03-19T02:35:07Z

+      log.info("Table[%s] already has column[task_group].", tableName);
+    } else {
+      log.info("Adding column[task_group] to table[%s].", tableName);
+      statements.add(StringUtils.format("ALTER TABLE %1$s ADD COLUMN task_group VARCHAR(255)", tableName));


Similar to validateSegmentsTable(), do we also need validation that the pending segments table is upgraded to the desired schema?

kfaraz

Thanks a lot for the changes, @AmatyaAvadhanula !

I have tried to spend some more time with the approach and I agree with you that this seems to be the best path forward.

It simplifies the upgrade logic to some extent.
It offers the opportunity to actively clean up pending segments once they are not needed anymore.
It probably also safeguards against OL crashes. In the current impl (before this PR), the OL going down would probably cause pending segment to upgraded pending segment mappings to get lost as they are not clearly persisted anywhere. The prev_segment_id might have that info but it is vague since that column has a very broad definition at this point.

Suggestions
These are my main suggestions:

Rename parent_id to upgraded_from_segment_id. That clarifies the exact meaning and purpose of this column.
This column should be null for an original (non-upgraded) pending segments.
Rename task_group to something more distinct. We can either stick to base_sequence_name because afaict, this value is always going to be the same as the baseSequenceName. Otherwise, how about we call it task_allocator_id?
Leave out the cleanup logic in the TaskLockbox, we can do that later.

kfaraz · 2024-04-01T09:51:29Z

@@ -135,14 +135,16 @@ public SegmentPublishResult perform(Task task, TaskActionToolbox toolbox)
    if (startMetadata == null) {
      publishAction = () -> toolbox.getIndexerMetadataStorageCoordinator().commitAppendSegments(
          segments,


At some point, the plan was to have one action for just committing segments and another action for committing segments and metadata both. So we decided to keep a method that would just commit segments.

But we eventually decided against having the two actions as it didn't really serve a lot of purpose. So now we could simplify the IndexerMetadataStorageCoordinator interface too.

kfaraz · 2024-04-05T03:33:25Z

              task.getId()
          );
        }
+        final String pendingSegmentGroup = task.getPendingSegmentGroup();


I would advise keeping the cleanup logic in a separate PR. We will be able to focus and test on it better.

kfaraz · 2024-04-05T04:17:41Z

    );
  }

+  protected void setSupervisorManager(SupervisorManager supervisorManager)


Rather than this, have a separate createTaskActionToolbox method that accepts a SupervisorManager.

…with_tasks

    );
  }

+  public Map<SegmentId, SegmentId> getAnnouncedSegmentsToParentSegments()


+  @SuppressWarnings("UnstableApiUsage")
+  public String computeSequenceNamePrevIdSha1(boolean skipSegmentLineageCheck)
+  {
+    final Hasher hasher = Hashing.sha1().newHasher()


…with_tasks

kfaraz

A few more comments. Yet to review the cleanup logic in TaskLockbox, will try to get it done soon.

kfaraz · 2024-04-12T15:17:54Z

+        pendingSegment.getId().asSegmentId().toString(),
+        pendingSegment.getId()
+    ));
+    Map<SegmentIdWithShardSpec, SegmentIdWithShardSpec> upgradedPendingSegments = new HashMap<>();


+1, rename to segmentToParent (we can omit the prefix pending as this method only deals with pending segments, and it can be taken for granted.)

kfaraz · 2024-04-12T15:21:38Z

  // this set should be accessed under the giant lock.
  private final Set<String> activeTasks = new HashSet<>();

+  // Stores map of pending task group of tasks to the set of their ids.


…with_tasks

kfaraz · 2024-04-16T03:23:37Z

+ * Pseudo code (for a single interval):
+ * For an append lock held over an interval:
+ *     transaction {
+ *       commit input segments contained within interval
+ *       if there is an active replace lock over the interval:
+ *         add an entry for the inputSegment corresponding to the replace lock's task in the upgradeSegments table
+ *       fetch pending segments with parent contained within the input segments, and commit them
+ *     }


It doesn't seem appropriate to have the implementation described as pseudo-code. Someone might as well read the code. It is better to briefly describe the key points of the implementation in a list fashion. (This is not a blocker for the PR).

kfaraz · 2024-04-16T03:23:51Z

 * your task for the segment intervals.
+ *
+ * <pre>
+ *  Pseudo code (for a single interval)


Same comment regarding pseudo-code.

kfaraz · 2024-04-16T03:25:26Z

+ * An interface to be implemented by every appending task that allocates pending segments.
+ */


Suggested change

* An interface to be implemented by every appending task that allocates pending segments.

*/

* An append task that can allocate pending segments. All concrete {@link Task} implementations that need to allocate pending segments must implement this interface.

*/

kfaraz

Reviewed cleanup logic, flow looks okay. Left some comments. Will take another final pass after all the existing comments are addressed.

kfaraz · 2024-04-16T04:38:35Z

  // this set should be accessed under the giant lock.
  private final Set<String> activeTasks = new HashSet<>();

+  // Stores map of pending task group of tasks to the set of their ids.


Please address this and also rename this map to activeAllocatorIdToTaskIds. You need not declare it as a HashMap, just a Map would suffice.

…with_tasks

kfaraz · 2024-04-16T09:34:17Z

+ * <li>  id -> id (Unique identifier for pending segment) <li/>
+ * <li>  sequence_name -> sequenceName (sequence name used for segment allocation) <li/>
+ * <li>  sequence_prev_id -> sequencePrevId (previous segment id used for segment allocation) <li/>
+ * <li>  upgraded_from_segment_id -> upgradedFromSegmentId (Id of the root segment from which this was upgraded) <li/>
+ * <li>  task_allocator_id -> taskAllocatorId (Associates a task / task group / replica group with the pending segment) <li/>


This list is not needed here, the description of the fields should be in the javadocs of the respective getters. (not a blocker for this PR).

kfaraz

There are leftover comments that can be addressed in a follow up.

kfaraz · 2024-04-16T16:20:17Z


 @JsonTypeName(MSQWorkerTask.TYPE)
-public class MSQWorkerTask extends AbstractTask
+public class MSQWorkerTask extends AbstractTask implements PendingSegmentAllocatingTask


This class need not implement PendingSegmentAllocatingTask as it never actually does any allocation. The allocation is always done by the controller task.

this can be addressed in a follow up PR.

kfaraz · 2024-04-16T16:29:04Z

+    Map<SegmentIdWithShardSpec, SegmentIdWithShardSpec> segmentToParent = new HashMap<>();
+    pendingSegments.forEach(pendingSegment -> {
+      if (pendingSegment.getUpgradedFromSegmentId() != null
+          && !pendingSegment.getUpgradedFromSegmentId().equals(pendingSegment.getId().asSegmentId().toString())) {


Can the upgradedFromSegmentId ever be equal to the id itself?
A normal/root pending segment (i.e. one created by allocation and not upgrade) would have upgraded_from_segment_id as null, right?

kfaraz · 2024-04-16T16:29:22Z

-        "Upgraded [%d] pending segments for REPLACE task[%s]: [%s]",
-        upgradedPendingSegments.size(), task.getId(), upgradedPendingSegments
-    );
+    final Set<ReplaceTaskLock> replaceLocksForTask = toolbox


Some comments here would be helpful.

kfaraz · 2024-04-16T16:31:15Z

+    ));
+    Map<SegmentIdWithShardSpec, SegmentIdWithShardSpec> segmentToParent = new HashMap<>();
+    pendingSegments.forEach(pendingSegment -> {
+      if (pendingSegment.getUpgradedFromSegmentId() != null


Should we only look at pending segments that were upgraded by this task rather than all upgraded pending segments?

kfaraz · 2024-04-16T16:31:52Z

 /**
 */
-public class NoopTask extends AbstractTask
+public class NoopTask extends AbstractTask implements PendingSegmentAllocatingTask


Does NoopTask need to implement the new interface for the purpose of tests?

kfaraz · 2024-04-16T16:36:41Z

+   * @param taskAllocatorId task id / task group / replica group for an appending task
+   * @return number of pending segments deleted from the metadata store
+   */
+  int deletePendingSegmentsForTaskGroup(String taskAllocatorId);


Method needs to be renamed.

kfaraz · 2024-04-17T03:13:13Z

    }
    catch (Exception e) {
-      log.error(e, "PendingSegment[%s] mapping update request to version[%s] on Supervisor[%s] failed",
+      log.error(e, "PendingSegmentRecord[%s] mapping update request to version[%s] on Supervisor[%s] failed",


This change is not needed.

kfaraz · 2024-04-17T03:21:52Z

    // always insert empty previous sequence id
-    insertPendingSegmentIntoMetastore(handle, newIdentifier, dataSource, interval, "", sequenceName, sequenceNamePrevIdSha1);
+    insertPendingSegmentIntoMetastore(handle, newIdentifier, dataSource, interval, "", sequenceName, sequenceNamePrevIdSha1,
+                                      taskAllocatorId


Formatting is off.

AmatyaAvadhanula added 7 commits March 6, 2024 22:37

Initial commit

dd63724

Merge remote-tracking branch 'upstream/master' into pending_segments_…

743696b

…with_tasks

Allocate and commit PendingSegment instead of SegmentIdWithShardSpec

5f77ea6

Merge remote-tracking branch 'upstream/master' into pending_segments_…

21ca620

…with_tasks

Get tests working

d2aa575

Get streaming tests working

04c2593

various minor fixes

0386c2a

github-actions Bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Mar 18, 2024

AmatyaAvadhanula mentioned this pull request Mar 18, 2024

Add tests for concurrent replace with streaming append #16115

Closed

10 tasks

github-advanced-security AI found potential problems Mar 18, 2024

View reviewed changes

Comment thread ...-service/src/test/java/org/apache/druid/indexing/common/task/concurrent/ActionsTestTask.java Fixed

Comment thread server/src/main/java/org/apache/druid/metadata/PendingSegment.java Fixed

Clean up unneeded tests

cf7815b

abhishekrb19 reviewed Mar 19, 2024

View reviewed changes

This was referenced Mar 19, 2024

Concurrent Streaming and Replace improvements #15813

Closed

Simplify concurrent streaming ingestion with replace #15844

Closed

Upgrade pending segments transactionally #15992

Closed

Resolve merge conflicts

e7e86b9

kfaraz requested changes Apr 5, 2024

View reviewed changes

kfaraz marked this pull request as ready for review April 5, 2024 04:21

AmatyaAvadhanula added 2 commits April 8, 2024 11:49

Merge remote-tracking branch 'upstream/master' into pending_segments_…

3805410

…with_tasks

Address most review comments

2a32dc2

github-advanced-security AI found potential problems Apr 9, 2024

View reviewed changes

AmatyaAvadhanula added 7 commits April 9, 2024 16:48

Rename columns and add new interface

9906afc

NoopTask can allocate segments

dfd3cad

IndexTasks can allocate pending segments

21b8ee5

Merge remote-tracking branch 'upstream/master' into pending_segments_…

8bb0feb

…with_tasks

Do not throw UOE

7ffe0ec

Handle test failures for legacy tasks

3e17b19

Fix order of cleanup after task removal

422411b

AmatyaAvadhanula added 3 commits April 12, 2024 11:18

Merge remote-tracking branch 'upstream/master' into pending_segments_…

f91b1a3

…with_tasks

Fix merge conflicts

de6b40a

Fix metadata cleanup after task removal

4653e60

kfaraz reviewed Apr 12, 2024

View reviewed changes

AmatyaAvadhanula added 2 commits April 15, 2024 11:20

Merge remote-tracking branch 'upstream/master' into pending_segments_…

4581133

…with_tasks

Address review and remove unneeded methods

b49b8d8

kfaraz reviewed Apr 16, 2024

View reviewed changes

AmatyaAvadhanula added 2 commits April 16, 2024 12:24

Merge remote-tracking branch 'upstream/master' into pending_segments_…

277596f

…with_tasks

Review comments

a21adbc

kfaraz reviewed Apr 16, 2024

View reviewed changes

AmatyaAvadhanula and others added 3 commits April 16, 2024 17:16

Compactions can allocate segments with segment locking

a6f958d

Tests for coverage

c31837a

Fix intellij inspection

477abce

kfaraz approved these changes Apr 17, 2024

View reviewed changes

kfaraz merged commit f3d69f3 into apache:master Apr 17, 2024

kfaraz deleted the pending_segments_with_tasks branch April 17, 2024 03:36

kfaraz mentioned this pull request Apr 18, 2024

Remove task action audit logging and druid_taskLog metadata table #16309

Merged

10 tasks

AmatyaAvadhanula mentioned this pull request Apr 22, 2024

Check for handoff of upgraded segments #16162

Merged

10 tasks

AmatyaAvadhanula mentioned this pull request May 3, 2024

Do not allocate ids conflicting with existing segment ids #16380

Merged

10 tasks

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

adarshsanjeev mentioned this pull request May 28, 2024

[DRAFT] 30.0.0 release notes #16505

Closed

AmatyaAvadhanula mentioned this pull request Jun 14, 2024

Fix attempts to publish the same pending segments multiple times #16605

Merged

10 tasks

		* An interface to be implemented by every appending task that allocates pending segments.
		*/

Conversation

AmatyaAvadhanula commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Check notice

Check notice

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AmatyaAvadhanula commented Mar 18, 2024 •

edited

Loading

kfaraz Apr 16, 2024 •

edited

Loading