Associate pending segments with the tasks that requested them#16144
Associate pending segments with the tasks that requested them#16144kfaraz merged 28 commits intoapache:masterfrom
Conversation
| ) | ||
| ) | ||
| ); | ||
| alterPendingSegmentsTableAddParentIdAndTaskGroup(tableName); |
There was a problem hiding this comment.
It'd be better to move this function out to createPendingSegmentsTable() after the call to this method.
| log.info("Table[%s] already has column[task_group].", tableName); | ||
| } else { | ||
| log.info("Adding column[task_group] to table[%s].", tableName); | ||
| statements.add(StringUtils.format("ALTER TABLE %1$s ADD COLUMN task_group VARCHAR(255)", tableName)); |
There was a problem hiding this comment.
Similar to validateSegmentsTable(), do we also need validation that the pending segments table is upgraded to the desired schema?
kfaraz
left a comment
There was a problem hiding this comment.
Thanks a lot for the changes, @AmatyaAvadhanula !
I have tried to spend some more time with the approach and I agree with you that this seems to be the best path forward.
- It simplifies the upgrade logic to some extent.
- It offers the opportunity to actively clean up pending segments once they are not needed anymore.
- It probably also safeguards against OL crashes. In the current impl (before this PR), the OL going down would probably cause pending segment to upgraded pending segment mappings to get lost as they are not clearly persisted anywhere. The
prev_segment_idmight have that info but it is vague since that column has a very broad definition at this point.
Suggestions
These are my main suggestions:
- Rename
parent_idtoupgraded_from_segment_id. That clarifies the exact meaning and purpose of this column. - This column should be null for an original (non-upgraded) pending segments.
- Rename
task_groupto something more distinct. We can either stick tobase_sequence_namebecause afaict, this value is always going to be the same as thebaseSequenceName. Otherwise, how about we call ittask_allocator_id? - Leave out the cleanup logic in the
TaskLockbox, we can do that later.
| @@ -135,14 +135,16 @@ public SegmentPublishResult perform(Task task, TaskActionToolbox toolbox) | |||
| if (startMetadata == null) { | |||
| publishAction = () -> toolbox.getIndexerMetadataStorageCoordinator().commitAppendSegments( | |||
| segments, | |||
There was a problem hiding this comment.
At some point, the plan was to have one action for just committing segments and another action for committing segments and metadata both. So we decided to keep a method that would just commit segments.
But we eventually decided against having the two actions as it didn't really serve a lot of purpose. So now we could simplify the IndexerMetadataStorageCoordinator interface too.
| task.getId() | ||
| ); | ||
| } | ||
| final String pendingSegmentGroup = task.getPendingSegmentGroup(); |
There was a problem hiding this comment.
I would advise keeping the cleanup logic in a separate PR. We will be able to focus and test on it better.
| ); | ||
| } | ||
|
|
||
| protected void setSupervisorManager(SupervisorManager supervisorManager) |
There was a problem hiding this comment.
Rather than this, have a separate createTaskActionToolbox method that accepts a SupervisorManager.
| ); | ||
| } | ||
|
|
||
| public Map<SegmentId, SegmentId> getAnnouncedSegmentsToParentSegments() |
Check notice
Code scanning / CodeQL
Exposing internal representation
| @SuppressWarnings("UnstableApiUsage") | ||
| public String computeSequenceNamePrevIdSha1(boolean skipSegmentLineageCheck) | ||
| { | ||
| final Hasher hasher = Hashing.sha1().newHasher() |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation
kfaraz
left a comment
There was a problem hiding this comment.
A few more comments. Yet to review the cleanup logic in TaskLockbox, will try to get it done soon.
| pendingSegment.getId().asSegmentId().toString(), | ||
| pendingSegment.getId() | ||
| )); | ||
| Map<SegmentIdWithShardSpec, SegmentIdWithShardSpec> upgradedPendingSegments = new HashMap<>(); |
There was a problem hiding this comment.
+1, rename to segmentToParent (we can omit the prefix pending as this method only deals with pending segments, and it can be taken for granted.)
| // this set should be accessed under the giant lock. | ||
| private final Set<String> activeTasks = new HashSet<>(); | ||
|
|
||
| // Stores map of pending task group of tasks to the set of their ids. |
| * Pseudo code (for a single interval): | ||
| * For an append lock held over an interval: | ||
| * transaction { | ||
| * commit input segments contained within interval | ||
| * if there is an active replace lock over the interval: | ||
| * add an entry for the inputSegment corresponding to the replace lock's task in the upgradeSegments table | ||
| * fetch pending segments with parent contained within the input segments, and commit them | ||
| * } |
There was a problem hiding this comment.
It doesn't seem appropriate to have the implementation described as pseudo-code. Someone might as well read the code. It is better to briefly describe the key points of the implementation in a list fashion. (This is not a blocker for the PR).
| * your task for the segment intervals. | ||
| * | ||
| * <pre> | ||
| * Pseudo code (for a single interval) |
There was a problem hiding this comment.
Same comment regarding pseudo-code.
| * An interface to be implemented by every appending task that allocates pending segments. | ||
| */ |
There was a problem hiding this comment.
| * An interface to be implemented by every appending task that allocates pending segments. | |
| */ | |
| * An append task that can allocate pending segments. All concrete {@link Task} implementations that need to allocate pending segments must implement this interface. | |
| */ |
kfaraz
left a comment
There was a problem hiding this comment.
Reviewed cleanup logic, flow looks okay. Left some comments. Will take another final pass after all the existing comments are addressed.
| // this set should be accessed under the giant lock. | ||
| private final Set<String> activeTasks = new HashSet<>(); | ||
|
|
||
| // Stores map of pending task group of tasks to the set of their ids. |
There was a problem hiding this comment.
Please address this and also rename this map to activeAllocatorIdToTaskIds. You need not declare it as a HashMap, just a Map would suffice.
| * <li> id -> id (Unique identifier for pending segment) <li/> | ||
| * <li> sequence_name -> sequenceName (sequence name used for segment allocation) <li/> | ||
| * <li> sequence_prev_id -> sequencePrevId (previous segment id used for segment allocation) <li/> | ||
| * <li> upgraded_from_segment_id -> upgradedFromSegmentId (Id of the root segment from which this was upgraded) <li/> | ||
| * <li> task_allocator_id -> taskAllocatorId (Associates a task / task group / replica group with the pending segment) <li/> |
There was a problem hiding this comment.
This list is not needed here, the description of the fields should be in the javadocs of the respective getters. (not a blocker for this PR).
kfaraz
left a comment
There was a problem hiding this comment.
There are leftover comments that can be addressed in a follow up.
|
|
||
| @JsonTypeName(MSQWorkerTask.TYPE) | ||
| public class MSQWorkerTask extends AbstractTask | ||
| public class MSQWorkerTask extends AbstractTask implements PendingSegmentAllocatingTask |
There was a problem hiding this comment.
This class need not implement PendingSegmentAllocatingTask as it never actually does any allocation. The allocation is always done by the controller task.
this can be addressed in a follow up PR.
| Map<SegmentIdWithShardSpec, SegmentIdWithShardSpec> segmentToParent = new HashMap<>(); | ||
| pendingSegments.forEach(pendingSegment -> { | ||
| if (pendingSegment.getUpgradedFromSegmentId() != null | ||
| && !pendingSegment.getUpgradedFromSegmentId().equals(pendingSegment.getId().asSegmentId().toString())) { |
There was a problem hiding this comment.
Can the upgradedFromSegmentId ever be equal to the id itself?
A normal/root pending segment (i.e. one created by allocation and not upgrade) would have upgraded_from_segment_id as null, right?
| "Upgraded [%d] pending segments for REPLACE task[%s]: [%s]", | ||
| upgradedPendingSegments.size(), task.getId(), upgradedPendingSegments | ||
| ); | ||
| final Set<ReplaceTaskLock> replaceLocksForTask = toolbox |
There was a problem hiding this comment.
Some comments here would be helpful.
| )); | ||
| Map<SegmentIdWithShardSpec, SegmentIdWithShardSpec> segmentToParent = new HashMap<>(); | ||
| pendingSegments.forEach(pendingSegment -> { | ||
| if (pendingSegment.getUpgradedFromSegmentId() != null |
There was a problem hiding this comment.
Should we only look at pending segments that were upgraded by this task rather than all upgraded pending segments?
| /** | ||
| */ | ||
| public class NoopTask extends AbstractTask | ||
| public class NoopTask extends AbstractTask implements PendingSegmentAllocatingTask |
There was a problem hiding this comment.
Does NoopTask need to implement the new interface for the purpose of tests?
| * @param taskAllocatorId task id / task group / replica group for an appending task | ||
| * @return number of pending segments deleted from the metadata store | ||
| */ | ||
| int deletePendingSegmentsForTaskGroup(String taskAllocatorId); |
There was a problem hiding this comment.
Method needs to be renamed.
| } | ||
| catch (Exception e) { | ||
| log.error(e, "PendingSegment[%s] mapping update request to version[%s] on Supervisor[%s] failed", | ||
| log.error(e, "PendingSegmentRecord[%s] mapping update request to version[%s] on Supervisor[%s] failed", |
| // always insert empty previous sequence id | ||
| insertPendingSegmentIntoMetastore(handle, newIdentifier, dataSource, interval, "", sequenceName, sequenceNamePrevIdSha1); | ||
| insertPendingSegmentIntoMetastore(handle, newIdentifier, dataSource, interval, "", sequenceName, sequenceNamePrevIdSha1, | ||
| taskAllocatorId |
This PR aims to associate pending segments with the task groups that created them.
Motivation:
Pending segment clean up from helps delete entries immediately after tasks exit and can alleviate the load on the metadata store during segment allocations. This can also help with segment allocation failures due to conflicting pending segments that are no longer needed in some cases.
A) When a concurrent replace occurs, it can upgrade any pending segments for concurrent append tasks to the replacing task's lock version.
B) When an appending task commits segments, it can not only commit the pending segments that it directly created but also their upgraded versions.
Previously, upgraded pending segments created by replace tasks could have different ids than the ones committed by the appending task finally. This doesn't affect batch appends.
However for concurrent streaming ingestion, there could be a race where an upgraded pending segment on the indexer would correspond to the same root segment as the parent of a different upgraded segment committed by the job.
In such a case, there could be a brief period when the upgraded realtime segment is being served, while the committed segment with a different id is also being served on a historical. Since their ids are distinct, the broker would allow both their results to be merged leading to data dupilcation in queries.
The change in protocol ensures that an append action upgrades a segment set which corresponds exactly to the pending segment upgrades made by the concurrent replace action, and eliminates any dupilcation in query results that may occur due to the above race.
Implementation details:
This PR has: