Skip to content

Datadog: fix names in Span::spawnChild#31366

Merged
ggreenway merged 3 commits into
envoyproxy:mainfrom
DataDog:david.goffredo/datadog-fix-names-in-spawn-child
Jan 2, 2024
Merged

Datadog: fix names in Span::spawnChild#31366
ggreenway merged 3 commits into
envoyproxy:mainfrom
DataDog:david.goffredo/datadog-fix-names-in-spawn-child

Conversation

@dgoffredo
Copy link
Copy Markdown
Contributor

The Bug and Its Fix

This is an additional fix for the bug described in #29932.

There are still situations in which the spans produced by the Datadog tracer have the wrong name (operation name). We want all Datadog spans to be named "envoy.proxy", but sometimes they have another value which should instead be the span's resource name.

#29932 addressed spans created by class Tracer, #30503 (which was about something else) addressed name changes due to Span::setOperation, and this pull request that you are reading addresses spans created by Span::spawnChild.

For a little background, there was a substantial rewrite of the Datadog tracing extension within Envoy in October 2022. Since then, the modified code has slowly worked its way into Envoy releases, and then into Istio releases. One difference between the old code and the new is that the old code was OpenTracing-based, while the new code is not. While making the changes, I failed to notice that the "operation name" that Envoy uses is not the same notion as that used in Datadog. This pull request and those linked above are my identifying all of the places where that confusion of terms appears.

This pull request changes the last inappropriate setting of a Datadog span's operation name, and adds dedicated unit tests to centralize and clarify the expected behavior.

Please tag this pull request for backport review. It fixes a bug that affects the v1.27 and v1.28 release branches. I'm happy to propose those backports once this pull request is reviewed.

Testing

See the new naming_test.cc and the modified span_test.cc.

Other Info

Commit Message: "Datadog: fix span name in Span::spawnChild"
Additional Description: see above
Risk Level: low
Testing: see included unit tests
Docs Changes: N/A
Release Notes: see included changes to changelogs/current.yaml
Platform Specific Features: N/A

Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
Signed-off-by: David Goffredo <david.goffredo@datadoghq.com>
Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by comment:
This seems to be a user-facing change. Please add a runtime guard around this to allow users to migrate this over time.

@@ -0,0 +1,223 @@
/**
* The tests in this file aren't specific to a class, but instead test all
* behavior related to spans' "operation name" (a.k.a. "span name"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if this is part of span, consider adding to the span_test.cc file (there are a few test files with multiple "Test" classes). This will allow quicker discovery of span relevant tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test involves both Tracer and Span. I could justify adding it to either of those drivers, but I prefer to leave it separate in order to emphasize the particular behavior under test.

tracer_test.cc and span_test.cc together cover the code paths under test here. I've added this new driver to ensure that the particular overall behavior of "operation name, "resource name," and "service name" is maintained.

@dgoffredo
Copy link
Copy Markdown
Contributor Author

Thanks for taking a look, @adisuissa.

This seems to be a user-facing change. Please add a runtime guard around this to allow users to migrate this over time.

This is a fix for a bug. I'd prefer that users automatically got the fix when upgrading, as opposed to having to enable some option to get the behavior they had before the bug was introduced.

@phlax
Copy link
Copy Markdown
Member

phlax commented Dec 18, 2023

/backport

@repokitteh-read-only repokitteh-read-only Bot added the backport/review Request to backport to stable releases label Dec 18, 2023
@adisuissa
Copy link
Copy Markdown
Contributor

Thanks for taking a look, @adisuissa.

This seems to be a user-facing change. Please add a runtime guard around this to allow users to migrate this over time.

This is a fix for a bug. I'd prefer that users automatically got the fix when upgrading, as opposed to having to enable some option to get the behavior they had before the bug was introduced.

Yeah, having it runtime guarded doesn't imply it will be false-by-default. It should be true-by-default, but allows Envoy users to disable it (hopefully for a short duration) while making any changes needed in the rest of their system.

@cgilmour
Copy link
Copy Markdown
Contributor

IMO, this is just a bug-fix and we don't have users relying on or desiring the behavior without the fix. The other changes that were referenced in the PR description had a more-direct impact on users, and this PR is essentially just a final cleanup of previous fixes.
A runtime guard for this isn't a great fit in the circumstance.

But it's good to be reminded about that, as we have another change being developed where a runtime guard (enabled by default, but has an option to disable) would be a perfect fit!

@njegosrailic
Copy link
Copy Markdown

Hello everyone,

developers at Wayfair actively use DataDog tracing for Istio-enabled workloads, and this functionality is currently partially broken on Istio v1.17 and newer. I understand it's the holiday season, but we would be grateful if you could help merge this change.

@adisuissa
Copy link
Copy Markdown
Contributor

This is a fix for a bug. I'd prefer that users automatically got the fix when upgrading, as opposed to having to enable some option to get the behavior they had before the bug was introduced.

The runtime guard can (and should) be enabled by default. It just allows safer migration across a fleet if one detects a problem with the new code until it is completely rolled out.
That said, I'll leave it up to the owners of this extension to decide.

Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!
Seems that the extension owners don't think this needs to be runtime guarded.

/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/senior-maintainers assignee is @ggreenway

🐱

Caused by: a #31366 (review) was submitted by @adisuissa.

see: more, trace.

@ggreenway ggreenway merged commit a03fa34 into envoyproxy:main Jan 2, 2024
@dgoffredo
Copy link
Copy Markdown
Contributor Author

Thanks again, @adisuissa. My teammates and I will look into runtime guards for future changes, knowing now that it's an option.

@phlax phlax removed the backport/review Request to backport to stable releases label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants