WINA-747 Complete solution to solve startup/shutdown race condition#25453
WINA-747 Complete solution to solve startup/shutdown race condition#25453
Conversation
It is the second and final part of the solution based on the first part partial solution (#25282). This PR is based on the same testing framework and its major contribution is detecting various parts of the race condition mostly based on the fact that ControlService with SERVICE_CONTROL_STOP code would fail if any dependent service is running. Accordingly, second or more iterations to stop dependent services will be required, unless total timeout is exceeded. This will make Agent restart-service robust enough to handle race conditions when the Agent is just starting, but starting concurrently to the Agent restarting via Agent CLI command. This will not help Powershell restart-service cmdlet which would have non insignificant chances to fail if by the time core agent service is about to stop one of its dependent services is concurrently started just before attempting to stop main service. In addition, a modest refactoring had been done to remove duplicate calls and reuse more recent Go runtime svc.ListDependentServices() function instead of handcrafted version.
eb9f950 to
8b351f4
Compare
buraizu
left a comment
There was a problem hiding this comment.
Thanks, approving with a minor rephrasing suggested for the release notes
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=33882450 --os-family=ubuntu |
|
[Fast Unit Tests Report] On pipeline 33882450 (CI Visibility). The following jobs did not run any unit tests: Jobs:
If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-developer-experience |
…54545176b.yaml Co-authored-by: Bryce Eadie <bryce.eadie@datadoghq.com>
| status, err = service.Control(command) | ||
| if err != nil { | ||
| return fmt.Errorf("could not send control %d: %w", command, err) | ||
| return fmt.Errorf("could not send control %d to service %s: %w", command, serviceName, err) |
There was a problem hiding this comment.
❓ question
Would it make sense to retry on error? Or errors here are not recoverable?
There was a problem hiding this comment.
It depends, this is low-level function. Retry exists in the callers
| if callback != nil { | ||
| callback.beforeStopService(serviceName) | ||
| } | ||
| for { |
There was a problem hiding this comment.
❓ question
have you considered using out of the box retry libraries like backoff.Retry? Wonder if we should sleep between each retry to reduce CPU impact
There was a problem hiding this comment.
I do not think so in this case. First it is capped by timeout, currently 30 seconds - it will be short and anyway and second ability for restart to be "tight" is relatively important, Second, doStopService I think waits 300 ms on each individual service stop, external, meaning this loop is to "detect" race condition which happens and immediately try to address it.
What does this PR do?
It is the second and final part of the solution based on the first part partial solution (#25282). This PR is based on the same testing framework and its major contribution is detecting various parts of the race condition mostly based on the fact that ControlService with SERVICE_CONTROL_STOP code would fail if any dependent service is running. Accordingly, second or more iterations to stop dependent services will be required, unless total timeout is exceeded.
This will make Agent restart-service robust enough to handle race conditions when the Agent is just starting, but starting concurrently to the Agent restarting via Agent CLI command. This will not help Powershell restart-service cmdlet which would have non insignificant chances to fail if by the time core agent service is about to stop one of its dependent services is concurrently started just before attempting to stop main service.
In addition, a modest refactoring had been done to remove duplicate calls and reuse more recent Go runtime svc.ListDependentServices() function instead of handcrafted version.
In addition to the race condition handled by part #1 PR (#25282)
datadogagentservice is starting in the end of Agent installation (no Agent services are running at this moment)agent.exe restart-serviceCLI command is started.agent.exe restart-serviceCLI command will get list of datadogagent dependent services to be stopped before stoppingdatadogagentservice itself.agent.exe restart-serviceCLI command enumerate all dependent services. Some of them running and some of them are not running or not running yet (depending on configuration). All running dependent services will be stopped.datadog-trace-agentservice (or other dependent services) is started by the still startingdatadogagentservice (from step no.1).agent.exe restart-serviceCLI command "presumes" that all dependent services are stopped and will finally call stop fordatadogagentservice itself. It will fail since just before this call [Process A] starteddatadog-trace-agentservice.Motivation
Eliminate race condition
Additional Notes
Possible Drawbacks / Trade-offs
This PR fixes race condition for Agent restart-service CLI command. It does not address fundamental and inherited race condition to unsuspected mechanism dealing strictly with Windows services, such as Powershell restart-service cmdlet.
Describe how to test/QA your changes
No needs to manually QA because all edge case tests are automated in this PR