Skip to content

WINA-747 Complete solution to solve startup/shutdown race condition#25453

Merged
iglendd merged 3 commits intomainfrom
len.gamburg/improve-agent-restart-service-part2
May 8, 2024
Merged

WINA-747 Complete solution to solve startup/shutdown race condition#25453
iglendd merged 3 commits intomainfrom
len.gamburg/improve-agent-restart-service-part2

Conversation

@iglendd
Copy link
Copy Markdown
Contributor

@iglendd iglendd commented May 8, 2024

What does this PR do?

It is the second and final part of the solution based on the first part partial solution (#25282). This PR is based on the same testing framework and its major contribution is detecting various parts of the race condition mostly based on the fact that ControlService with SERVICE_CONTROL_STOP code would fail if any dependent service is running. Accordingly, second or more iterations to stop dependent services will be required, unless total timeout is exceeded.

This will make Agent restart-service robust enough to handle race conditions when the Agent is just starting, but starting concurrently to the Agent restarting via Agent CLI command. This will not help Powershell restart-service cmdlet which would have non insignificant chances to fail if by the time core agent service is about to stop one of its dependent services is concurrently started just before attempting to stop main service.

In addition, a modest refactoring had been done to remove duplicate calls and reuse more recent Go runtime svc.ListDependentServices() function instead of handcrafted version.

In addition to the race condition handled by part #1 PR (#25282)

  1. [Process A] datadogagent service is starting in the end of Agent installation (no Agent services are running at this moment)
  2. [Process B] Externally and concurrently (after Agent configuration change) agent.exe restart-service CLI command is started.
  3. [Process B] agent.exe restart-service CLI command will get list of datadogagent dependent services to be stopped before stopping datadogagent service itself.
  4. [Process B] agent.exe restart-service CLI command enumerate all dependent services. Some of them running and some of them are not running or not running yet (depending on configuration). All running dependent services will be stopped.
  5. [Process A] datadog-trace-agent service (or other dependent services) is started by the still starting datadogagent service (from step no.1).
  6. [Process B] agent.exe restart-service CLI command "presumes" that all dependent services are stopped and will finally call stop for datadogagent service itself. It will fail since just before this call [Process A] started datadog-trace-agent service.
  7. [PR key fix] [Process B] will recognize error condition that a dependent service or services are still running, expect that and will iterate one or more time attempting to stop dependents services until all of them are finally be stopped.

Motivation

Eliminate race condition

Additional Notes

Possible Drawbacks / Trade-offs

This PR fixes race condition for Agent restart-service CLI command. It does not address fundamental and inherited race condition to unsuspected mechanism dealing strictly with Windows services, such as Powershell restart-service cmdlet.

Describe how to test/QA your changes

No needs to manually QA because all edge case tests are automated in this PR

@iglendd iglendd requested review from derekwbrown and richl-dd May 8, 2024 00:02
@iglendd iglendd added this to the 7.55.0 milestone May 8, 2024
@iglendd iglendd marked this pull request as ready for review May 8, 2024 00:26
@iglendd iglendd requested review from a team as code owners May 8, 2024 00:27
It is the second and final part of the solution based on the first part
partial solution (#25282).
This PR is based on the same testing framework and its major
contribution is detecting various parts of the race condition mostly
based on the fact that ControlService with SERVICE_CONTROL_STOP code
would fail if any dependent service is running. Accordingly, second
or more iterations to stop dependent services will be required, unless
total timeout is exceeded.

This will make Agent restart-service robust enough to handle race
conditions when the Agent is just starting, but starting concurrently
to the Agent restarting via Agent CLI command. This will not help
Powershell restart-service cmdlet which would have non insignificant
chances to fail if by the time core agent service is about to stop
one of its dependent services is concurrently started just before
attempting to stop main service.

In addition, a modest refactoring had been done to remove duplicate
calls and reuse more recent Go runtime svc.ListDependentServices()
function instead of handcrafted version.
@iglendd iglendd force-pushed the len.gamburg/improve-agent-restart-service-part2 branch from eb9f950 to 8b351f4 Compare May 8, 2024 00:31
@iglendd iglendd changed the title WINA-747 Complete solution to solve startup/shutdown race con…dition WINA-747 Complete solution to solve startup/shutdown race condition May 8, 2024
Copy link
Copy Markdown
Contributor

@buraizu buraizu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, approving with a minor rephrasing suggested for the release notes

Comment thread releasenotes/notes/fix-agent-start-restart-concurrency-c32580454545176b.yaml Outdated
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 8, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=33882450 --os-family=ubuntu

@agent-platform-auto-pr
Copy link
Copy Markdown
Contributor

agent-platform-auto-pr Bot commented May 8, 2024

[Fast Unit Tests Report]

On pipeline 33882450 (CI Visibility). The following jobs did not run any unit tests:

Jobs:
  • tests_deb-arm64-py3
  • tests_deb-x64-py3
  • tests_flavor_dogstatsd_deb-x64
  • tests_flavor_heroku_deb-x64
  • tests_flavor_iot_deb-x64
  • tests_rpm-arm64-py3
  • tests_rpm-x64-py3

If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-developer-experience

…54545176b.yaml

Co-authored-by: Bryce Eadie <bryce.eadie@datadoghq.com>
@iglendd iglendd merged commit 225a640 into main May 8, 2024
@iglendd iglendd deleted the len.gamburg/improve-agent-restart-service-part2 branch May 8, 2024 18:33
status, err = service.Control(command)
if err != nil {
return fmt.Errorf("could not send control %d: %w", command, err)
return fmt.Errorf("could not send control %d to service %s: %w", command, serviceName, err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ question
Would it make sense to retry on error? Or errors here are not recoverable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends, this is low-level function. Retry exists in the callers

if callback != nil {
callback.beforeStopService(serviceName)
}
for {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ question
have you considered using out of the box retry libraries like backoff.Retry? Wonder if we should sleep between each retry to reduce CPU impact

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so in this case. First it is capped by timeout, currently 30 seconds - it will be short and anyway and second ability for restart to be "tight" is relatively important, Second, doStopService I think waits 300 ms on each individual service stop, external, meaning this loop is to "detect" race condition which happens and immediately try to address it.

@julien-lebot julien-lebot added the qa/done QA done before merge and regressions are covered by tests label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[deprecated] team/windows-agent qa/done QA done before merge and regressions are covered by tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants