Skip to content

Check for handoff of upgraded segments#16162

Merged
kfaraz merged 28 commits intoapache:masterfrom
AmatyaAvadhanula:wait_for_handoff_upgraded_segments
Apr 25, 2024
Merged

Check for handoff of upgraded segments#16162
kfaraz merged 28 commits intoapache:masterfrom
AmatyaAvadhanula:wait_for_handoff_upgraded_segments

Conversation

@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor

@AmatyaAvadhanula AmatyaAvadhanula commented Mar 19, 2024

CHANGES

  1. Check for handoff of upgraded realtime segments.

Currently streaming ingestion tasks check for the handoff of segments in SegmentsAndCommitMetadata.getSegments() which contains the set of segments allocated by the task.
This can lead to a temporary missing of data from queries while the upgraded segments are being handed off as they will be dropped from peons when the older versions of segments are committed.
This patch adds the set of upgraded segments to the class, to enable checking for handoff of the upgraded segments as well.


  1. Drop sink only when all associated realtime segments have been abandoned.

If a sink S is associated with (realtime) segment versions S0, S1 and S2, we must wait until all of them have been unannounced before dropping the sink.

abandon (sink, segment):
    mark segment as abandoned.
    if every upgraded id of the base segment is abandoned:
        unannounce all upgraded ids of its base segment
        drop sink

  1. Delete pending segments upon commit

A realtime task such as index_kafka or index_kinesis may append segments multiple times during its lifetime.
Deleting pending segments in the same transaction in which they are committed prevents unneeded upgrades and partition space exhaustion when a concurrent replace happens.
More importantly this helps with change (2) to prevent a race where an upgraded pending segment is announced for a committed segment causing temporary data duplication.


  1. Register pending segment upgrade only on those tasks to which the segment is associated.

If there are N pending realtime segments and T tasks, we make O(N * T) calls today by trying to upgrade each of the N pending segments on each of the T tasks.
By checking if task.baseSequenceName == pendingSegment.taskAllocatorId, we can reduce this to O(N)


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

committerSupplier.setMetadata(i + 1);
Assert.assertTrue(driver.add(ROWS.get(i), "dummy", committerSupplier, false, true).isOk());
}
committerSupplier.setMetadata(1);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is intentional to avoid any uncertainty that may arise due to the order of segment ids.
By adding a single row, exactly one segment is created and we can verify its id in the exception message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we control the order in the test?

@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as ready for review March 20, 2024 06:30
@abhishekagarwal87
Copy link
Copy Markdown
Contributor

what does original segment mean here? can you please elaborate?

Copy link
Copy Markdown
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM otherwise.

final Object callerMetadata = metadata == null
? null
: ((AppenderatorDriverMetadata) metadata).getCallerMetadata();
final Set<DataSegment> upgradedSegments = new HashSet<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved inside the retry block, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we return the result after the retry block as well and would need the upgraded segments there.
Am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfaraz Thank you for fixing this.


private final Object commitMetadata;
private final ImmutableList<DataSegment> segments;
private final ImmutableSet<DataSegment> upgradedSegments;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment that these are extra versions created in case a replace happened in between.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have this as a javadoc on getUpgradedSegments().

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @AmatyaAvadhanula ! Left some comments.


private final Object commitMetadata;
private final ImmutableList<DataSegment> segments;
private final ImmutableSet<DataSegment> upgradedSegments;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this class need to distinguish between upgraded segments and root ones? From the POV of this class, all of these segments were committed and thus should all be handed off.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the POV of segment handoff, the original and upgraded segments need not be distinguished from each other.

However, there are sanity checks where we must ensure that the committed segments are the same as the ones that were requested. Maintaining the upgraded segments separately helps with this check.
When the check fails and segments need to be killed from deep storage, we would also have to ensure that we do not run into errors trying to kill the same deep storage location multiple times. While this could be done by creating a set of load specs before every kill, I think it may be neater to maintain the orignal and upgraded sets separately for such purposes.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are sanity checks where we must ensure that the committed segments are the same as the ones that were requested.

I should think the checks just need to verify that atleast everything that was requested has been committed. There could be other stuff that got committed too.

Also, could you please link the places where these sanity checks are being performed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, the cleanup happens in BaseAppenderatorDriver.publishInBackground itself. For cleanup, we would use the original requested segmentsAndCommitMetadata and not the one we freshly created and returned.
That's what we seem to be doing in the current impl in this PR as well.

throw new RuntimeException(e);
}
return segmentsAndCommitMetadata;
return segmentsAndCommitMetadata.withUpgradedSegments(upgradedSegments);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have to distinguish between upgraded and other segments here. Instead of creating a new SegmentsAndCommitMetadata with upgraded segments, we should create a new SegmentsAndCommitMetadata which has a single set containing all the segments (upgraded and otherwise) which were committed and thus must be handed off.

Suggested change
return segmentsAndCommitMetadata.withUpgradedSegments(upgradedSegments);
return new SegmentsAndCommitMetadata(publishResult.getSegments(), metadata);

So, essentially, the original segmentsAndCommitMetadata object passed in to this method represents segments we wanted to commit and the new object would represent segments that were actually committed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be addressed later.

@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor Author

AmatyaAvadhanula commented Mar 26, 2024

Marking as draft as there is a potential problem that still needs to be addressed.

A sink X corresponds to a segment S.
The segment S was upgraded to another id T querying the same sink X.

One must ensure that the sink X is dropped after both S and T have been handed off. At present, it may be dropped after either of them has been dropped.

@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as draft March 26, 2024 04:08
}

public Map<SegmentId, SegmentId> getAnnouncedSegmentsToParentSegments()
public Map<SegmentId, String> getAnnouncedSegmentsToParentSegments()

Check notice

Code scanning / CodeQL

Exposing internal representation

getAnnouncedSegmentsToParentSegments exposes the internal representation stored in field announcedSegmentsToParentSegments. The value may be modified [after this call to getAnnouncedSegmentsToParentSegments](1). getAnnouncedSegmentsToParentSegments exposes the internal representation stored in field announcedSegmentsToParentSegments. The value may be modified [after this call to getAnnouncedSegmentsToParentSegments](2).
@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as ready for review April 22, 2024 09:22
@github-actions github-actions Bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 22, 2024
@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor Author

@kfaraz I've incorporated your pending feedback from #16144 as well.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, a couple of files have not been reviewed yet.

Comment thread server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java Outdated
Comment thread server/src/main/java/org/apache/druid/indexing/overlord/SegmentPublishResult.java Outdated
final RequestBuilder requestBuilder
= new RequestBuilder(HttpMethod.POST, "/pendingSegmentVersion")
.jsonContent(jsonMapper, new PendingSegmentVersions(basePendingSegment, newVersionOfSegment));
.jsonContent(jsonMapper, pendingSegmentRecord);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change this API? The task side doesn't seem to need any info other than the base segment id and the upgraded segment id.

Postponing this refactor until later might simplify this PR a bit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have the original pending segment's SegmentIdWithShardSpec to continue using this API.

throw new RuntimeException(e);
}
return segmentsAndCommitMetadata;
return segmentsAndCommitMetadata.withUpgradedSegments(upgradedSegments);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be addressed later.

@AmatyaAvadhanula AmatyaAvadhanula requested a review from kfaraz April 24, 2024 17:02
expectedException.expect(ExecutionException.class);
expectedException.expectCause(CoreMatchers.instanceOf(ISE.class));
expectedException.expectMessage(
"Fail test while dropping segment[foo_2000-01-01T00:00:00.000Z_2000-01-01T01:00:00.000Z_abc123]"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have the segment ID in the message anymore?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do but I'm observing that there is some flakiness in the test.
My local setup throws an exception with the same segment id.
The Github actions fail with a different segment id.

committerSupplier.setMetadata(i + 1);
Assert.assertTrue(driver.add(ROWS.get(i), "dummy", committerSupplier, false, true).isOk());
}
committerSupplier.setMetadata(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we control the order in the test?

Comment on lines +1170 to +1179
baseSegmentToUpgradedSegments.get(basePendingSegment).add(newSegmentVersion);
upgradedSegmentToBaseSegment.put(newSegmentVersion, basePendingSegment);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two operations can be put inside a method. Also, I guess it is better to use computeIfAbsent for baseSegmentToUpgradedSegments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ensure that the base segment has been added for every sink that is present.
i.e the set in the value of the map needs to contain the key as well.

@AmatyaAvadhanula AmatyaAvadhanula force-pushed the wait_for_handoff_upgraded_segments branch from 2dc3b10 to 248d614 Compare April 25, 2024 07:45
…aAvadhanula/druid into wait_for_handoff_upgraded_segments
* Unannounces the given base segment and all its upgraded versions.
*/
private void unannounceAllVersionsOfSegment(DataSegment baseSegment) throws IOException
private void unannounceAllVersionsOfSegment(DataSegment baseSegment)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment in the javadoc that this method should be synchronized on the corresponding sink. Alternatively, you could even take the sink as an argument and synchronize on it inside the method rather than relying on the callers to do so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronized inside the method. Thanks!

if (baseId == null) {
return;
}
baseSegmentToUpgradedSegments.get(baseId).remove(id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the set here cannot be null, it is best to handle the null case too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer needs to be handled because of #16162 (comment)

Comment on lines +1156 to +1157
if (baseSegmentToUpgradedSegments.get(baseId).isEmpty()) {
baseSegmentToUpgradedSegments.remove(baseId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we do this here? Why can't we do this at the end of unannounceAllVersionsOfSegment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for suggesting the simpilfication

public void registerNewVersionOfPendingSegment(PendingSegmentRecord pendingSegmentRecord) throws IOException
{
SegmentIdWithShardSpec basePendingSegment = idToPendingSegment.get(pendingSegmentRecord.getUpgradedFromSegmentId());
SegmentIdWithShardSpec newSegmentVersion = pendingSegmentRecord.getId();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename for homogeneity with rest of the Druid code:

Suggested change
SegmentIdWithShardSpec newSegmentVersion = pendingSegmentRecord.getId();
SegmentIdWithShardSpec upgradedPendingSegment = pendingSegmentRecord.getId();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Renamed the method as well

// The base segment is associated with itself in the maps to maintain all the upgraded ids of a sink.
baseSegmentToUpgradedSegments.put(identifier, new HashSet<>());
baseSegmentToUpgradedSegments.get(identifier).add(identifier);
upgradedSegmentToBaseSegment.put(identifier, identifier);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not anymore.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, have some minor queries, none of which are blockers to this PR.

);
}

private ListenableFuture<?> abandonSegment(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a javadoc to this method? When exactly is a segment abandoned?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every segment is abandoned when the StreamAppenderator is closed or cleared.
A segment is also marked to be abandoned by the StreamAppenderatorDriver when it has been handed off

@kfaraz kfaraz merged commit 31eee7d into apache:master Apr 25, 2024
@kfaraz kfaraz deleted the wait_for_handoff_upgraded_segments branch April 25, 2024 16:33
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Apr 25, 2024

Thanks a lot for the changes, @AmatyaAvadhanula !

@AmatyaAvadhanula
Copy link
Copy Markdown
Contributor Author

@kfaraz, @abhishekagarwal87 Thank you for the reviews

AmatyaAvadhanula added a commit to AmatyaAvadhanula/druid that referenced this pull request Apr 29, 2024
Changes:
1) Check for handoff of upgraded realtime segments.
2) Drop sink only when all associated realtime segments have been abandoned.
3) Delete pending segments upon commit to prevent unnecessary upgrades and
partition space exhaustion when a concurrent replace happens. This also prevents
potential data duplication.
4) Register pending segment upgrade only on those tasks to which the segment is associated.
kfaraz added a commit that referenced this pull request Apr 30, 2024
Changes:
1) Check for handoff of upgraded realtime segments.
2) Drop sink only when all associated realtime segments have been abandoned.
3) Delete pending segments upon commit to prevent unnecessary upgrades and
partition space exhaustion when a concurrent replace happens. This also prevents
potential data duplication.
4) Register pending segment upgrade only on those tasks to which the segment is associated.

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants