Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout#10213
Avoid deletion of load/drop entry from CuratorLoadQueuePeon in case of load timeout#10213jihoonson merged 6 commits intoapache:masterfrom
Conversation
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
This pull request/issue is no longer marked as stale. |
|
This pull request/issue is no longer marked as stale. |
1 similar comment
|
This pull request/issue is no longer marked as stale. |
clintropolis
left a comment
There was a problem hiding this comment.
this seems like a useful change, I tested it out and it seems to relax the issue described in #10193 (comment).
@a2l007 any chance you can fix up conflicts?
| try { | ||
| if (curator.checkExists().forPath(path) != null) { | ||
| failAssign(segmentHolder, new ISE("%s was never removed! Failing this operation!", path)); | ||
| failAssign(segmentHolder, true, new ISE("%s was never removed! Failing this operation!", path)); |
There was a problem hiding this comment.
I think it would be worth clarifying this log message to indicate that for load operations, that while the coordinator has given up, the historical might still process and load the requested segments. Maybe something like "Load segments operation timed out, %s was never removed! Abandoning attempt, (but these segments might still be loaded)". I guess it would need to adjust message based on whether it was a load or drop.
There was a problem hiding this comment.
I've modified the message here. Please let me know if this works.
| private void failAssign(SegmentHolder segmentHolder, boolean handleTimeout, Exception e) | ||
| { | ||
| if (e != null) { | ||
| log.error(e, "Server[%s], throwable caught when submitting [%s].", basePath, segmentHolder); |
There was a problem hiding this comment.
I'm not sure why we don't emit exceptions currently (using EmittingLogger.makeAlert()), but should we? At least for the segment loading timeout error, it would be nice to emit those errors so that cluster operators can notice there is something going wrong with segment loading.
There was a problem hiding this comment.
Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?
There was a problem hiding this comment.
Also as a followup PR I was planning to add the timedOut segment list to the /druid/coordinator/v1/loadqueue along with some docs about its usage in understanding the cluster behavior.
There was a problem hiding this comment.
Alerting sounds like a good idea, but my concern is that since the alert would happen per segment, a slowness on the historical side can generate a large number of alerts for a fairly large cluster. What do you think?
I think it's a valid concern. We may be able to emit those exceptions in bulk if they are thrown in a short time frame. I believe this should be done in a separate PR even if we want, and thus my comment is not a blocker for this PR.
Also as a followup PR I was planning to add the timedOut segment list to the
/druid/coordinator/v1/loadqueuealong with some docs about its usage in understanding the cluster behavior.
Thanks. It sounds good to me.
| loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1); | ||
| // Timed out segments need to be replicated in another server for faster availability | ||
| if (!serverHolder.getPeon().getTimedOutSegments().contains(segment)) { | ||
| loadingSegments.put(segment.getId(), server.getTier(), numReplicants + 1); |
There was a problem hiding this comment.
loadingSegments is not just a set of segments loading anymore. Please add some javadoc in SegmentReplicantLookup about this.
There was a problem hiding this comment.
As @himanshug pointed out in #10193 (comment), there could be two types of slow segment loading.
- There are a few historicals being slow in segment loading in the cluster. This can be caused by unbalanced load queues or some intermittent failures.
- Historicals are OK, but ingestion might outpace the ability to load segments.
This particular change in SegmentReplicantLookup could help in the former case, but make things worse in the latter case. In an extreme case, all historicals could have the same set of timed-out segments in their load queue. This might be still OK though, because, if that's the case, Druid cannot get out of that state by itself anyway. The system administrator should add more historicals or use more threads for parallel segment loading. However, we should provide relevant data so that system administrators can tell what's happening. I left another comment about emitting exceptions to provide such data.
There was a problem hiding this comment.
@jihoonson @himanshug Would it make sense to make the replication behavior user configurable? We could have a dynamic config like replicateAfterLoadTimeout which would control whether the segments would be attempted to be replicated to a different historical in case of a load timeout to the current historical. The default could be true but a cluster operator can set this to false if they wish to avoid the additional churn and know the historicals are OK and it would eventually load the segments.
There was a problem hiding this comment.
Adding a config seems reasonable to me 👍
There was a problem hiding this comment.
It sounds good to me too.
There was a problem hiding this comment.
Added a config. I've set replicateAfterLoadTimeout to false as the default I feel it might be better to preserve the existing behaviour and admins need to be aware of this property's behavior before setting it to true. Let me know what you think.
There was a problem hiding this comment.
It sounds good to me to preserve the existing behavior by default.
Fixes #10193.
CuratorLoadQueuePeonno longer deletes segment load/drop entries in casedruid.coordinator.load.timeoutexpires. Deleting these entries after a timeout can cause the balancer to work incorrectly, as described in the linked issue.With this fix, the segment entries will remain in the load/drop queue for a peon until the ZK entry is deleted by the historical, unless a non-timeout related exception occurs. This helps the balancer to account for the actual queue size for historicals and can lead to better balancing decisions.
This PR has: