Do not kill segments with referenced load specs from deep storage#16667
Conversation
kfaraz
left a comment
There was a problem hiding this comment.
@AmatyaAvadhanula , thanks for addressing this issue!
While updating the SegmentNukeAction does make the change very simple, it is semantically incorrect. Any task action should just be expected to return the result of the performed action. It is very difficult to reason about a segment delete operation returning "root segment IDs that are unreferenced".
Ideally, we would want the kill task to do something like this:
- In every iteration of the
forloop inside the kill task, identify the segments to be nuked. - (new task action) Determine the root segment IDs of the segments to be nuked.
- Nuke the segments.
- (new task action) Determine which root segment IDs are still referenced.
- Filter out the results and kill from deep storage only the load specs for the unreferenced IDs.
IIUC, the current set of changes is trying to perform 2, 3 and 4 in the same action, which would lead to confusion and sub-optimal performances.
No. 2 need not be a separate action. We can update the RetrieveUnusedSegmentsAction to return a new DataSegmentWithRootId class, which has all the fields of DataSegment class along with a nullable field rootSegmentId. (We could even just add the new field to DataSegment class. I am okay with that too, as it would be null all the time except here.)
No. 4 has to be a separate action though.
Rolling upgrades
When the overlord is on an older version, 2 will return null (if we reuse RetrieveUnusedSegmentsAction) and 4 will throw an exception. In this case, we just fallback to doing what we do today.
Nomenclature
Let's go withupgraded_from_segment_id instead as it is already being used with pending segments. Its meaning is also clear since "upgrade" is now a core concept.
Let me know what you think.
There was a problem hiding this comment.
Regarding the addition of a new column root_segment_id in the segments table, are there other use cases outside of kill tasks where this could be useful? I'm also wondering about the storage footprint this additional bookkeeping might cause at scale.
In general, do we have a sense of how often segments get upgraded? "It depends" is a fair answer :) but I'm curious if it's worth tracking this information in a separate table vs adding a new column to the existing segments table.
|
@kfaraz @abhishekrb19 The reason I had made the change within an existing TaskAction besides simplicity was to ensure that segments killed from the metadata store actually get killed from deep storage.
Having multiple actions would mean that a lock revokation between the actions can cause the task to fail after nuking the segments during determination of segments to kill from deep storage. In such a case, segments may not be cleaned up from deep storage. However, I've changed the behaviour so that one can determine the segments to be killed before they are killed, with the assumption that every segment in the batch would be killed. |
@AmatyaAvadhanula , just to clarify, the new task actions are simple read actions and do not need to check if the relevant segments are covered by locks or not. They do not need to run the SELECT statements in critical sections either. So locks being revoked would have no effect on the action unless the task itself has been killed by the Overlord. But I am not sure whether the Overlord explicitly kills a task whose locks have been revoked. |
|
@abhishekrb19 , I have tried to address your queries below.
Right now, kill task correctness and debugging are the only two usages of this column.
True, it is additional bookkeeping but it will become crucial when "concurrent append/replace" is widely adopted and made the default cluster behaviour.
This column will be populated only when a segment has actually been upgraded from another. That happens when a replace task (say compaction) runs concurrently with an append task (say Kafka ingestion), which will be a fairly common occurrence once "concurrent append and replace" is enabled and For all non-upgraded segments, this column value would be null. We would prefer to have this in the
Please let me know if this makes sense and if you have any other concern with adding this column to the |
kfaraz
left a comment
There was a problem hiding this comment.
@AmatyaAvadhanula , the code seems somewhat more complicated than it needs to be, since you are trying to do two things in the new task action.
You need two separate task actions:
FindUpgradedFromSegmentIds
Input: Set of segment IDs you want to nuke
(you can skip on sending the whole DataSegment to reduce payload size)
Output: Map from input ID to itsupgraded_from_segment_id.
Implementation:
# In batches of 100
SELECT id, upgraded_from_segment_id WHERE id IN (%s)The kill task must check if the task action returned no mapping for a certain segment ID. In that case, just use the segment ID itself as its upgraded_from_segment_id.
You may even completely skip out on adding the first task action by just adding this map to
RetrieveUnusedSegmentsActionitself, the query wiring is already there, I think.
Perform the SegmentNukeAction, and then perform the following action:
FindUpgradedToSegmentIds
Input: Set ofupgraded_from_segment_id(plus self whereupgraded_from_segment_idwas null) determined in the previous step
Ouput: Map from input ID to set of its children.
Implementation:
# In batches of 100, add a condition on datasource too if the index uses it
SELECT id, upgraded_from_segment_id WHERE upgraded_from_segment_id IN (%s)Now filter out the segment IDs which still have any child and delete the deep storage files for the rest.
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the updated changes, @AmatyaAvadhanula . Left a few more suggestions.
abhishekrb19
left a comment
There was a problem hiding this comment.
This column represents core nature of the segment rather than an additional piece of info. It can be thought of as a proxy for the complete load spec, which is part of the segment payload itself.
Keeping it in the same table means we don't need to fire another statement in the transaction when we are trying to read or write this value. This value is always written at the time an upgrade segment is INSERTed and typically read with the rest of the segment columns.
Okay, makes sense. The access patterns and ease of usage seem to justify having the column in the segments table itself. I see the column is also part of the pending segments table.
Thanks for the clarifications! @kfaraz @AmatyaAvadhanula
| import java.util.Objects; | ||
|
|
||
| /** | ||
| * Information about the id and upgraded from segment id created as a result of segment upgrades |
There was a problem hiding this comment.
"ID and upgraded from segment ID" -- for the former, can we mention child segment ID or something to disambiguate the multiple IDs?
| // Kill segments from the deep storage only if their load specs are not being used by any used segments | ||
| final List<DataSegment> segmentsToBeKilled = unusedSegments | ||
| // Kill segments from the deep storage only if their load specs are not being used by any other segments | ||
| // We still need to check with used load specs as segment upgrades were introduced before this change |
There was a problem hiding this comment.
Given that the concurrent compaction feature is still experimental, do we have the liberty to clean up firing the used segment task action and this additional check? Removing it could potentially lead to a kill correctness issue and leave dangling segments for datasources where concurrent compaction is enabled. It's essentially one less task action, but in terms of the evolution of experimental features, can we afford to break compatibility in some sense?
| // upgraded_from_segment_id is the first segment to which the same load spec originally belonged | ||
| // Load specs can be shared as a result of segment version upgrade | ||
| // This column is null for segments that haven't been upgraded. | ||
| columnNameTypes.put("upgraded_from_segment_id", "VARCHAR(255)"); |
There was a problem hiding this comment.
For completeness, should we also add a check in validateSegmentsTable() to look for the presence of the new column upgraded_from_segment_id? When auto-table creation isn't enabled, we should let the operator know that the column and index should be added manually if it's not already there.
There was a problem hiding this comment.
I think this is an existing shortcoming that applies to other tables too.
Maybe we should deal with it in a separate PR.
There was a problem hiding this comment.
Keeping the conversation as unresolved so that we know that is to be addressed later.
|
@AmatyaAvadhanula , let's finalize the following structure for the responses of the task actions.
class UpgradedFromSegmentsResponse
{
// Map from a segment ID to the segment ID from which it was upgraded
// There should be no entry in the map for an original non-upgraded segment
private final Map<String, String> upgradedFromSegmentIds;
}
class UpgradedToSegmentsResponse
{
// Map from a segment ID to a set containing
// 1. all segment IDs that were upgraded from it AND are still present in the metadata store
// 2. the segment ID itself if and only if it is a non-upgraded segment AND is still present in the metadata store
private final Map<String, Set<String>> upgradedToSegmentIds;
} |
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the update and for adding the tests, @AmatyaAvadhanula !
Left some comments.
kfaraz
left a comment
There was a problem hiding this comment.
Sorry for the back and forth, @AmatyaAvadhanula .
But the PR needs to be refined a little.
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for your patience through this review, @AmatyaAvadhanula !
Changes look good, minor tweaks can be done later.
| final Map<String, String> upgradedFromSegmentIds = new HashMap<>(); | ||
| connector.retryWithHandle( | ||
| handle -> { | ||
| Query<Map<String, Object>> query = handle.createQuery(sql) |
There was a problem hiding this comment.
Follow up work:
This query should probably be executed on batches of 100 segment IDs at a time.
There was a problem hiding this comment.
Isn't this controlled by the batch size of kill tasks to some extent?
There was a problem hiding this comment.
Yes, it would be controlled by the batch size of kill. But it is still possible for someone to either increase those limits or just fire this action with a large set of segment IDs. The overlord side should have its own safeguards.
| .forEach(id -> upgradedToSegmentIds.computeIfAbsent(id, k -> new HashSet<>()).add(id)); | ||
| connector.retryWithHandle( | ||
| handle -> { | ||
| Query<Map<String, Object>> query = handle.createQuery(sql) |
There was a problem hiding this comment.
Follow up work:
This one should probably be broken up in batches too.
|
Thank you for the reviews, @kfaraz and @abhishekrb19! |
…ache#16667) Do not kill segments with referenced load specs from deep storage
|
@kfaraz this came up in the review of the release notes for Druid 31. Can you please write a user-facing release note for this change? How does the change affect users? Thanks. |
Description
Druid now allows upgrades of segments to higher versions with the same load specs. This was introduced as part of concurrent append and replace and may have other uses.
When a load spec is shared by several segments it is important that the deep storage location is killed only after all the metadata references to it are nuked. This is currently being handled only for used segments.
Changes
This PR ensures that every used and unused segment reference is considered before killing files on deep storage.
upgraded_from_segment_id:Add a new column to druid_segments table in the metadata store: upgraded_from_segment_id, which indicates the root ancestor to which the load spec originally belonged.
Modify SegmentTransactionalAppendAction and SegmentTransactionalReplaceAction to populate the upgraded_from_segment_id and add tests.
Release note
This PR has: