Remove legacy code from LogUsedSegments duty#10287
Conversation
Edit: This is resolvedI recently became a committer, but haven't gotten repo access setup yet (If another committer reads this and is willing to guide me on this process, I'm all ears!) so I cannot complete all of the committer steps. For labels, I would add:
|
| |`druid.coordinator.loadqueuepeon.repeatDelay`|The start and repeat delay for the loadqueuepeon, which manages the load and drop of segments.|PT0.050S (50 ms)| | ||
| |`druid.coordinator.asOverlord.enabled`|Boolean value for whether this Coordinator process should act like an Overlord as well. This configuration allows users to simplify a druid cluster by not having to deploy any standalone Overlord processes. If set to true, then Overlord console is available at `http://coordinator-host:port/console.html` and be sure to set `druid.coordinator.asOverlord.overlordService` also. See next.|false| | ||
| |`druid.coordinator.asOverlord.overlordService`| Required, if `druid.coordinator.asOverlord.enabled` is `true`. This must be same value as `druid.service` on standalone Overlord processes and `druid.selectors.indexing.serviceName` on Middle Managers.|NULL| | ||
| |`druid.coordinator.duties.logUsedSegments.enabled`|Boolean value for whether or not the coordinator should execute the `LogUsedSegments` Duty|true| |
There was a problem hiding this comment.
We could make the property description friendlier by describing what LogUsedSegments does instead of just mentioning it. Also it would be helpful to add a line or two about when should a cluster operator think about disabling this property.
There was a problem hiding this comment.
+1 to describe more about the behavior of LogUsedSegments duty. Also I prefer to remove the duties from the config item name - druid.coordinator.logUsedSegments.enabled. duties is a term in the code not in the user facing documentation, and before it was called helpers in the code.
There was a problem hiding this comment.
@a2l007 @ArvinZheng thanks for the thoughts. I agree that if duty isn't a user facing word, it should be removed. I also did my best to improve the documentation for end users to better understand what they are disabling if they choose to do so.
| |`druid.coordinator.loadqueuepeon.repeatDelay`|The start and repeat delay for the loadqueuepeon, which manages the load and drop of segments.|PT0.050S (50 ms)| | ||
| |`druid.coordinator.asOverlord.enabled`|Boolean value for whether this Coordinator process should act like an Overlord as well. This configuration allows users to simplify a druid cluster by not having to deploy any standalone Overlord processes. If set to true, then Overlord console is available at `http://coordinator-host:port/console.html` and be sure to set `druid.coordinator.asOverlord.overlordService` also. See next.|false| | ||
| |`druid.coordinator.asOverlord.overlordService`| Required, if `druid.coordinator.asOverlord.enabled` is `true`. This must be same value as `druid.service` on standalone Overlord processes and `druid.selectors.indexing.serviceName` on Middle Managers.|NULL| | ||
| |`druid.coordinator.logUsedSegments.enabled`|Boolean value for whether or not the coordinator should execute the `LogUsedSegments` portion of its coordination work. `LogUsedSegments` is an informational job run by the Coordinator every coordination cycle. It gets a snapshot of segments in the cluster and iterates them. While iterating, it will emit an alert if a segment has a size less than 0. If debug logging is enabled, it will also log a string representation of each segment. Lastly, it logs then number of segments in the cluster. An admin can decide that forgoing this work may advantageous if they don't need any of the information provided.|true| |
There was a problem hiding this comment.
typo: it logs then number of > it logs the number of
There was a problem hiding this comment.
not a comment, just wanted to see if the following sounds good?
Boolean value for whether or not the coordinator should execute the LogUsedSegments portion of its coordination work. LogUsedSegments is an informational job run by the Coordinator every coordination cycle which logs every segment at DEBUG level and the total number of used segments at INFO level, in addition, it emits an alert if a segment has a size less than 0. An admin can decide that forgoing this work may advantageous if they don't need any of the information provided
There was a problem hiding this comment.
I like your wording, Arvin. made the changes.
|
slightly surprised by I am assuming DEBUG is already not enabled, so 10 secs are going in just iterating over the segments and checking the size to print alert. I wonder if the slowness comes from use of can you try to change the code to iterate over segments using simply nested for loops in LogUsedSegments.java , e.g. and see if it still takes |
I'm not sure if I will be able to test at same scale as our prod cluster that is taking 10 seconds. but I can try in a lower env that has about 1/2 the segments as prod to see what that looks like. will try tomorrow. Also, the javadoc calls out that this will be knowingly slower than simple iteration of a list. The use of this method is not that wide so it is probably not drastically hurting perf across the coordinator. still worth taking a look at though. |
|
@himanshug so I added a timer in my smaller environment and see that it is very fast to run LogUsedSegments as it is written today for cluster that has ~300 datasources and ~150k used segments My prod cluster is quite a bit larger. over 1k datasources and over 1MM segments. But I am not going to be able to add the timer there or test the nested for loop any time soon because we are in a change freeze until the new year. there does seem to be evidence of a good speedup in the smaller scale test I did. not sure if you think it is worth opening separate issue/PR to address the usages of the existing stream approach. But the question is, how many clusters operate at a scale where the increased performance is worth getting read of that nifty utility method |
|
@capistrant thanks for the test. this is still surprising .. I did a quick benchmark (see #10604 ) and the iteration looks very fast (relative to ~10 sec) with streams and for-loops both even for 1000 dataSources and 2000 segments each i.e. 2mn segments overall . |
Our estimates were from wall clock time looking at logs. But I admit it is pretty hand wavy and glosses over some facts. EmitClusterStatsAndMetrics logs out some stuff at the end of its run. We then have our configured 30 second backoff time. Then we execute the historical management duties runnable again and the first duty is LogUsedSegment and it logs when it finishes. so if we have these two wall clock values 2020-11-25T18:05:42,18 you can say there was 11 seconds between the end of the backoff time and the completion of the first duty. But this neglects all of the stuff in DutiesRunnable#run() before we start running duties as well as any discrepancy in the amount of time that is actually backed off for between the end of one run and the next. |
|
circling back on this. my cluster with 161k segments takes ~20ms to execute big scheme of things, do you think this PR should be closed with reasoning of that it is not worth adding a config for such small savings? I'd like to tie up the loose end in my WIP tracking by either going forward with this or closing it out. I won't be upset if it is rejected, but I'd probably still enable this in my production environment just because we find zero value in the duty. |
|
real numbers reported from #10603 might make this more obvious eventually. That said, On the surface, I do agree with your assessment of the utility of I would set if, after a bunch of releases, nobody notices it then maybe remove that code altogether. |
Hmmm. If we are going to go with a disabled and undocumented implementation, I will add a design review label to make sure we get +1 from an extra committer. I do slightly question the removal because the debug logging may actually be useful for some people. |
jihoonson
left a comment
There was a problem hiding this comment.
LGTM, but please fix the documentation before merge.
| |`druid.coordinator.loadqueuepeon.repeatDelay`|The start and repeat delay for the loadqueuepeon, which manages the load and drop of segments.|PT0.050S (50 ms)| | ||
| |`druid.coordinator.asOverlord.enabled`|Boolean value for whether this Coordinator process should act like an Overlord as well. This configuration allows users to simplify a druid cluster by not having to deploy any standalone Overlord processes. If set to true, then Overlord console is available at `http://coordinator-host:port/console.html` and be sure to set `druid.coordinator.asOverlord.overlordService` also. See next.|false| | ||
| |`druid.coordinator.asOverlord.overlordService`| Required, if `druid.coordinator.asOverlord.enabled` is `true`. This must be same value as `druid.service` on standalone Overlord processes and `druid.selectors.indexing.serviceName` on Middle Managers.|NULL| | ||
| |`druid.coordinator.logUsedSegments.enabled`|Boolean value for whether or not the coordinator should execute the `LogUsedSegments` portion of its coordination work. `LogUsedSegments` is an informational job run by the Coordinator every coordination cycle which logs every segment at DEBUG level and the total number of used segments at INFO level. In addition to these logs, it emits an alert if a segment has a size less than 0. An admin can decide that forgoing this work may advantageous if they don't need any of the information provided.| |
There was a problem hiding this comment.
The default value column is missing.
There was a problem hiding this comment.
Also, did you want to not document it?
There was a problem hiding this comment.
well @himanshug had offered up the idea of not documenting it and setting it to true to phase this duty out all together. I'm receptive to that idea and can implement it but a part of me wants to avoid that since the logging of each segment at debug level could actually be useful to people out in the wild.
for now, I fixed this default value. let me know what you think is best for documentation/default value.
If I had to vote, I would keep my implementation because it may actually be useful. The only thing I might change now or in the future is to default it to being off since it seems like a bit more of a debug duty.
| "Done making historical management duties %s", | ||
| duties.stream().map(duty -> duty.getClass().getName()).collect(Collectors.toList()) | ||
| ); | ||
| return duties; |
There was a problem hiding this comment.
nit: Collections.unmodifiableList(duties)
There was a problem hiding this comment.
reverting to the original way of doing this now that LogUsedSegments is always used again
There was a problem hiding this comment.
I like this log message from LogUsedSegments and have found it very useful:
log.info("Found [%,d] used segments.", params.getUsedSegments().size());It's nearly free, since params.getUsedSegments() is something that's just passed into the method and doesn't need to be computed. Instead of a config to disable the duty, how about about removing this block and leaving everything else the same:
DataSourcesSnapshot dataSourcesSnapshot = params.getDataSourcesSnapshot();
for (DataSegment segment : dataSourcesSnapshot.iterateAllUsedSegmentsInSnapshot()) {
if (segment.getSize() < 0) {
log.makeAlert("No size on a segment")
.addData("segment", segment)
.emit();
}
}I'm pretty sure this is ancient debugging code and we can safely get rid of it.
The other potentially expensive part is skipped if log level is INFO or higher. (Which is the default, so it shouldn't be a problem.)
You could also use (That's written with profiling queries in mind, but the technique would work just as well for profiling anything.) |
I think I like this idea the most considering everything we have learned since I created the PR. I am going to amend the PR to remove the configuration to skip the duty. Instead we will remove this legacy code and leave everything else as is. Maybe in the future this duty will change and become expensive resulting in having the ability to skip become beneficial. but lets not put the cart ahead of the horse and over complicate things today. |
* allow the LogUsedSegments duty to be skippped * Fixes for TravisCI coverage checks and documentation spell checking * prameterize DruidCoordinatorTest in order to achieve coverage * update config name to remove duty ref and improve documentation * refine documentation for new config with reviewer advice * add default column to docs for new config * remove legacy code in LogUsedSegments and remove config to disbale duty * fix makeHistoricalMangementDuties now that the returned list is always the same
Update on 01/07/20
After more discussion. We have decided to scrap the ability to disable and instead remove the legacy code that emits alerts for segments with no size. This code is thought to be old debugging code that is safe to remove. Now the LogUsedSegments duty gives a helpful log of # of active segments no matter what. and if debug logging is enabled, used segments are logged. This duty has no performance concerns for non-debug clusters meaning there is no reason to add complexity of new config to disable. In future, it may be beneficial depending on future of this duty. But for now it seems like unnecessary code.
Update on 12/21/20
After analysis of the utility of this duty, we are now considering disabling this duty by default and not documenting the config to enable it. With the idea being that we will remove the duty altogether in the future if it proves that nobody questions it being off by default. Because of this plan, I added a
Design Reviewlabel to the PR. More input on this plan would be helpful.Description
Allows cluster admin to disable the LogUsedSegments coordinator duty if they so choose. This duty does not have any impact on successive duties so disabling it does not cause any change to the state of the cluster after coordination duties complete.
As an admin for a large enterprise cluster myself, I have been on the hunt for any optimization I can find in the lifecycle of the coordinator. This patch appears to be low hanging fruit in my efforts. This duty takes ~10 seconds to complete on our largest cluster. We have never used debug logging to log our segments and we have never checked the emitters alert stream for alerts on segments with size
<0. Therefore, I think it makes sense to give admins the ability to toggle this duty on and off as they see fit.New Config:
druid.coordinator.duties.logUsedSegments.enabled... Default: trueI added a config to the
DruidCoordinatorConfig. The naming scheme of this config can be easily change as the community sees fit. I made my choice in the hope that it is self-explanatory and allows for more future duty specific configs to be added following the same scheme.This config defaults to having the LogUsedSegments duty enabled (keeping inline with current functionality). When the coordinator is becoming a leader and creating its duties, it consults the config object to find out if the LogUsedSegments duty should be added to the duty lists for historical segment management and indexing.
This PR has:
Key changed/added classes in this PR
DruidCoordinatorConfigDruidCoordinator