Check for handoff of upgraded segments by AmatyaAvadhanula · Pull Request #16162 · apache/druid

AmatyaAvadhanula · 2024-03-19T06:31:15Z

CHANGES

Check for handoff of upgraded realtime segments.

Currently streaming ingestion tasks check for the handoff of segments in SegmentsAndCommitMetadata.getSegments() which contains the set of segments allocated by the task.
This can lead to a temporary missing of data from queries while the upgraded segments are being handed off as they will be dropped from peons when the older versions of segments are committed.
This patch adds the set of upgraded segments to the class, to enable checking for handoff of the upgraded segments as well.

Drop sink only when all associated realtime segments have been abandoned.

If a sink S is associated with (realtime) segment versions S0, S1 and S2, we must wait until all of them have been unannounced before dropping the sink.

abandon (sink, segment):
    mark segment as abandoned.
    if every upgraded id of the base segment is abandoned:
        unannounce all upgraded ids of its base segment
        drop sink

Delete pending segments upon commit

A realtime task such as index_kafka or index_kinesis may append segments multiple times during its lifetime.
Deleting pending segments in the same transaction in which they are committed prevents unneeded upgrades and partition space exhaustion when a concurrent replace happens.
More importantly this helps with change (2) to prevent a race where an upgraded pending segment is announced for a committed segment causing temporary data duplication.

Register pending segment upgrade only on those tasks to which the segment is associated.

If there are N pending realtime segments and T tasks, we make O(N * T) calls today by trying to upgrade each of the N pending segments on each of the T tasks.
By checking if task.baseSequenceName == pendingSegment.taskAllocatorId, we can reduce this to O(N)

This PR has:

…upgraded_segments

AmatyaAvadhanula · 2024-03-20T06:30:42Z

-      committerSupplier.setMetadata(i + 1);
-      Assert.assertTrue(driver.add(ROWS.get(i), "dummy", committerSupplier, false, true).isOk());
-    }
+    committerSupplier.setMetadata(1);


This change is intentional to avoid any uncertainty that may arise due to the order of segment ids.
By adding a single row, exactly one segment is created and we can verify its id in the exception message.

Can't we control the order in the test?

abhishekagarwal87 · 2024-03-20T09:13:45Z

what does original segment mean here? can you please elaborate?

abhishekagarwal87

Minor comments. LGTM otherwise.

abhishekagarwal87 · 2024-03-20T10:43:33Z

    final Object callerMetadata = metadata == null
                                  ? null
                                  : ((AppenderatorDriverMetadata) metadata).getCallerMetadata();
+    final Set<DataSegment> upgradedSegments = new HashSet<>();


this should be moved inside the retry block, right?

I think that we return the result after the retry block as well and would need the upgraded segments there.
Am I missing something?

@kfaraz Thank you for fixing this.

abhishekagarwal87 · 2024-03-20T10:46:02Z


  private final Object commitMetadata;
  private final ImmutableList<DataSegment> segments;
+  private final ImmutableSet<DataSegment> upgradedSegments;


please add a comment that these are extra versions created in case a replace happened in between.

Better to have this as a javadoc on getUpgradedSegments().

kfaraz

Thanks for the fix, @AmatyaAvadhanula ! Left some comments.

kfaraz · 2024-03-26T03:04:51Z


  private final Object commitMetadata;
  private final ImmutableList<DataSegment> segments;
+  private final ImmutableSet<DataSegment> upgradedSegments;


Why does this class need to distinguish between upgraded segments and root ones? From the POV of this class, all of these segments were committed and thus should all be handed off.

From the POV of segment handoff, the original and upgraded segments need not be distinguished from each other.

However, there are sanity checks where we must ensure that the committed segments are the same as the ones that were requested. Maintaining the upgraded segments separately helps with this check.
When the check fails and segments need to be killed from deep storage, we would also have to ensure that we do not run into errors trying to kill the same deep storage location multiple times. While this could be done by creating a set of load specs before every kill, I think it may be neater to maintain the orignal and upgraded sets separately for such purposes.

there are sanity checks where we must ensure that the committed segments are the same as the ones that were requested.

I should think the checks just need to verify that atleast everything that was requested has been committed. There could be other stuff that got committed too.

Also, could you please link the places where these sanity checks are being performed?

Looking at the code, the cleanup happens in BaseAppenderatorDriver.publishInBackground itself. For cleanup, we would use the original requested segmentsAndCommitMetadata and not the one we freshly created and returned.
That's what we seem to be doing in the current impl in this PR as well.

kfaraz · 2024-03-26T03:06:03Z

                throw new RuntimeException(e);
              }
-              return segmentsAndCommitMetadata;
+              return segmentsAndCommitMetadata.withUpgradedSegments(upgradedSegments);


We shouldn't have to distinguish between upgraded and other segments here. Instead of creating a new SegmentsAndCommitMetadata with upgraded segments, we should create a new SegmentsAndCommitMetadata which has a single set containing all the segments (upgraded and otherwise) which were committed and thus must be handed off.

Suggested change

return segmentsAndCommitMetadata.withUpgradedSegments(upgradedSegments);

return new SegmentsAndCommitMetadata(publishResult.getSegments(), metadata);

So, essentially, the original segmentsAndCommitMetadata object passed in to this method represents segments we wanted to commit and the new object would represent segments that were actually committed.

This can be addressed later.

AmatyaAvadhanula · 2024-03-26T04:06:36Z

Marking as draft as there is a potential problem that still needs to be addressed.

A sink X corresponds to a segment S.
The segment S was upgraded to another id T querying the same sink X.

One must ensure that the sink X is dropped after both S and T have been handed off. At present, it may be dropped after either of them has been dropped.

…upgraded_segments

  }

-  public Map<SegmentId, SegmentId> getAnnouncedSegmentsToParentSegments()
+  public Map<SegmentId, String> getAnnouncedSegmentsToParentSegments()


…upgraded_segments

AmatyaAvadhanula · 2024-04-22T15:41:13Z

@kfaraz I've incorporated your pending feedback from #16144 as well.

…upgraded_segments

kfaraz

Left some comments, a couple of files have not been reviewed yet.

kfaraz · 2024-04-24T07:21:49Z

    final RequestBuilder requestBuilder
        = new RequestBuilder(HttpMethod.POST, "/pendingSegmentVersion")
-        .jsonContent(jsonMapper, new PendingSegmentVersions(basePendingSegment, newVersionOfSegment));
+        .jsonContent(jsonMapper, pendingSegmentRecord);


Why do we need to change this API? The task side doesn't seem to need any info other than the base segment id and the upgraded segment id.

Postponing this refactor until later might simplify this PR a bit.

We no longer have the original pending segment's SegmentIdWithShardSpec to continue using this API.

kfaraz · 2024-04-24T07:26:32Z

                throw new RuntimeException(e);
              }
-              return segmentsAndCommitMetadata;
+              return segmentsAndCommitMetadata.withUpgradedSegments(upgradedSegments);


This can be addressed later.

kfaraz · 2024-04-25T03:42:35Z

    expectedException.expect(ExecutionException.class);
    expectedException.expectCause(CoreMatchers.instanceOf(ISE.class));
    expectedException.expectMessage(
-        "Fail test while dropping segment[foo_2000-01-01T00:00:00.000Z_2000-01-01T01:00:00.000Z_abc123]"


Why don't we have the segment ID in the message anymore?

We do but I'm observing that there is some flakiness in the test.
My local setup throws an exception with the same segment id.
The Github actions fail with a different segment id.

kfaraz · 2024-04-25T03:43:03Z

-      committerSupplier.setMetadata(i + 1);
-      Assert.assertTrue(driver.add(ROWS.get(i), "dummy", committerSupplier, false, true).isOk());
-    }
+    committerSupplier.setMetadata(1);


Can't we control the order in the test?

kfaraz · 2024-04-25T05:01:02Z

+    baseSegmentToUpgradedSegments.get(basePendingSegment).add(newSegmentVersion);
+    upgradedSegmentToBaseSegment.put(newSegmentVersion, basePendingSegment);


These two operations can be put inside a method. Also, I guess it is better to use computeIfAbsent for baseSegmentToUpgradedSegments.

We ensure that the base segment has been added for every sink that is present.
i.e the set in the value of the map needs to contain the key as well.

…upgraded_segments

…aAvadhanula/druid into wait_for_handoff_upgraded_segments

kfaraz · 2024-04-25T09:30:48Z

   * Unannounces the given base segment and all its upgraded versions.
   */
-  private void unannounceAllVersionsOfSegment(DataSegment baseSegment) throws IOException
+  private void unannounceAllVersionsOfSegment(DataSegment baseSegment)


Maybe add a comment in the javadoc that this method should be synchronized on the corresponding sink. Alternatively, you could even take the sink as an argument and synchronize on it inside the method rather than relying on the callers to do so.

Synchronized inside the method. Thanks!

kfaraz · 2024-04-25T11:32:03Z

+    if (baseId == null) {
+      return;
+    }
+    baseSegmentToUpgradedSegments.get(baseId).remove(id);


Even though the set here cannot be null, it is best to handle the null case too.

This no longer needs to be handled because of #16162 (comment)

kfaraz · 2024-04-25T11:40:29Z

+    if (baseSegmentToUpgradedSegments.get(baseId).isEmpty()) {
+      baseSegmentToUpgradedSegments.remove(baseId);


Why should we do this here? Why can't we do this at the end of unannounceAllVersionsOfSegment?

Done. Thanks for suggesting the simpilfication

kfaraz · 2024-04-25T11:43:11Z

+  public void registerNewVersionOfPendingSegment(PendingSegmentRecord pendingSegmentRecord) throws IOException
  {
+    SegmentIdWithShardSpec basePendingSegment = idToPendingSegment.get(pendingSegmentRecord.getUpgradedFromSegmentId());
+    SegmentIdWithShardSpec newSegmentVersion = pendingSegmentRecord.getId();


Rename for homogeneity with rest of the Druid code:

Suggested change

SegmentIdWithShardSpec newSegmentVersion = pendingSegmentRecord.getId();

SegmentIdWithShardSpec upgradedPendingSegment = pendingSegmentRecord.getId();

Done. Renamed the method as well

kfaraz · 2024-04-25T11:47:19Z

+    // The base segment is associated with itself in the maps to maintain all the upgraded ids of a sink.
+    baseSegmentToUpgradedSegments.put(identifier, new HashSet<>());
+    baseSegmentToUpgradedSegments.get(identifier).add(identifier);
+    upgradedSegmentToBaseSegment.put(identifier, identifier);


Why is this needed?

It's not anymore.

kfaraz

Looks good, have some minor queries, none of which are blockers to this PR.

kfaraz · 2024-04-25T11:48:32Z

+    );
+  }
+
  private ListenableFuture<?> abandonSegment(


Could you add a javadoc to this method? When exactly is a segment abandoned?

Every segment is abandoned when the StreamAppenderator is closed or cleared.
A segment is also marked to be abandoned by the StreamAppenderatorDriver when it has been handed off

kfaraz · 2024-04-25T16:33:54Z

Thanks a lot for the changes, @AmatyaAvadhanula !

AmatyaAvadhanula · 2024-04-25T16:40:01Z

@kfaraz, @abhishekagarwal87 Thank you for the reviews

Changes: 1) Check for handoff of upgraded realtime segments. 2) Drop sink only when all associated realtime segments have been abandoned. 3) Delete pending segments upon commit to prevent unnecessary upgrades and partition space exhaustion when a concurrent replace happens. This also prevents potential data duplication. 4) Register pending segment upgrade only on those tasks to which the segment is associated.

Changes: 1) Check for handoff of upgraded realtime segments. 2) Drop sink only when all associated realtime segments have been abandoned. 3) Delete pending segments upon commit to prevent unnecessary upgrades and partition space exhaustion when a concurrent replace happens. This also prevents potential data duplication. 4) Register pending segment upgrade only on those tasks to which the segment is associated. Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

Check for handoff of upgraded segments

e129773

AmatyaAvadhanula mentioned this pull request Mar 19, 2024

Concurrent Streaming and Replace improvements #15813

Closed

10 tasks

AmatyaAvadhanula added 3 commits March 20, 2024 10:14

Add test

561f5bb

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

2b1a54f

…upgraded_segments

Fix test uncertainty

f9a148b

AmatyaAvadhanula commented Mar 20, 2024

View reviewed changes

AmatyaAvadhanula marked this pull request as ready for review March 20, 2024 06:30

abhishekagarwal87 reviewed Mar 20, 2024

View reviewed changes

kfaraz reviewed Mar 26, 2024

View reviewed changes

AmatyaAvadhanula marked this pull request as draft March 26, 2024 04:08

AmatyaAvadhanula added 4 commits March 27, 2024 09:20

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

1183bb8

…upgraded_segments

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

20e1672

…upgraded_segments

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

21bd593

…upgraded_segments

Drop sink only when all associated segments have been abandoned

a93c6e2

github-actions Bot added the Area - Ingestion label Apr 22, 2024

github-advanced-security AI found potential problems Apr 22, 2024

View reviewed changes

AmatyaAvadhanula added 2 commits April 22, 2024 14:38

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

b18d79c

…upgraded_segments

Fix test flakyness

4ba335c

AmatyaAvadhanula marked this pull request as ready for review April 22, 2024 09:22

Fix npe and other feedback

84cf8b5

github-actions Bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 22, 2024

AmatyaAvadhanula added 4 commits April 22, 2024 21:21

Fix compilation

0599f3f

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

1c4859a

…upgraded_segments

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

09b19de

…upgraded_segments

Uannounce all versions at once

cca38db

kfaraz reviewed Apr 24, 2024

View reviewed changes

Delete pending segments in batches of at most 100

5e084ea

AmatyaAvadhanula requested a review from kfaraz April 24, 2024 17:02

AmatyaAvadhanula and others added 2 commits April 25, 2024 09:07

Resolve conflicts

4dc1e0e

Initialize sets only where needed

2dc3b10

kfaraz reviewed Apr 25, 2024

View reviewed changes

AmatyaAvadhanula added 3 commits April 25, 2024 13:00

Address feedback

c83b7d9

More comments

aad03c9

Merge remote-tracking branch 'upstream/master' into wait_for_handoff_…

248d614

…upgraded_segments

AmatyaAvadhanula force-pushed the wait_for_handoff_upgraded_segments branch from 2dc3b10 to 248d614 Compare April 25, 2024 07:45

Merge branch 'wait_for_handoff_upgraded_segments' of github.com:Amaty…

a676b29

…aAvadhanula/druid into wait_for_handoff_upgraded_segments

kfaraz reviewed Apr 25, 2024

View reviewed changes

Comment thread server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java Outdated

kfaraz reviewed Apr 25, 2024

View reviewed changes

Comment thread server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java

kfaraz reviewed Apr 25, 2024

View reviewed changes

Synchronize within method

c72ad00

kfaraz reviewed Apr 25, 2024

View reviewed changes

kfaraz approved these changes Apr 25, 2024

View reviewed changes

Simplify logic

5d911a7

kfaraz approved these changes Apr 25, 2024

View reviewed changes

kfaraz merged commit 31eee7d into apache:master Apr 25, 2024

kfaraz deleted the wait_for_handoff_upgraded_segments branch April 25, 2024 16:33

AmatyaAvadhanula mentioned this pull request Apr 29, 2024

[Backport] Check for handoff of upgraded segments (#16162) #16344

Merged

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

kfaraz mentioned this pull request Oct 11, 2024

[DRAFT] 31.0.0 Release Notes #17332

Closed

	return segmentsAndCommitMetadata.withUpgradedSegments(upgradedSegments);
	return new SegmentsAndCommitMetadata(publishResult.getSegments(), metadata);

		baseSegmentToUpgradedSegments.get(basePendingSegment).add(newSegmentVersion);
		upgradedSegmentToBaseSegment.put(newSegmentVersion, basePendingSegment);

		if (baseSegmentToUpgradedSegments.get(baseId).isEmpty()) {
		baseSegmentToUpgradedSegments.remove(baseId);

	SegmentIdWithShardSpec newSegmentVersion = pendingSegmentRecord.getId();
	SegmentIdWithShardSpec upgradedPendingSegment = pendingSegmentRecord.getId();

Conversation

AmatyaAvadhanula commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhishekagarwal87 commented Mar 20, 2024

Uh oh!

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Check notice

AmatyaAvadhanula commented Apr 22, 2024

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

AmatyaAvadhanula commented Mar 19, 2024 •

edited

Loading

kfaraz Mar 26, 2024 •

edited

Loading

AmatyaAvadhanula commented Mar 26, 2024 •

edited

Loading