Skip to content

discovery: enable by default on agents >= 7.78#2915

Merged
tbavelier merged 7 commits intomainfrom
guillaume.pagnoux/service-discovery-default-enable
Apr 30, 2026
Merged

discovery: enable by default on agents >= 7.78#2915
tbavelier merged 7 commits intomainfrom
guillaume.pagnoux/service-discovery-default-enable

Conversation

@Yumasi
Copy link
Copy Markdown
Member

@Yumasi Yumasi commented Apr 20, 2026

What does this PR do?

This PR enables Service Discovery by default on agents >=7.78, which will use the new System-Probe-Lite binary for low-resource overhead. The version is determined based on the image tag, and an unparseable tag is considered to be the latest version.

The change makes the distinction between default enablement, and an explicit one. This allows users of an agent <7.78 to still enable discovery if they truly wants it, via system-probe (much higher overhead).

Motivation

Enable Service Discovery by default.

Additional Notes

This PR uses the fallback mechanism of System-Probe to decide to start System-Probe-Lite. This is done in order to avoid a drift between the operator & the agent in the handling of the fallback decision (i.e we would have to maintain a list of system-probe feature, as proposed #2610, which could be easily missed when adding new features. Missing it could cause wrong behaviors like not enabling system-probe when we actually should for another feature, for example).

Using system-probe to control the fallback decision, like in bare-Linux & Docker installations, avoid this sort of issues, but this means that we will see a spike in memory consumption due to system-probe's size itself (~70MiB). Once the fallback to System-Probe-Lite happens, the memory use is expected to fallback to ~3.4MiB (measured in dogfooding).

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: v7.78.0

Describe your test plan

Fallback behaviour + system-probe container creation tested with Kind and the following cases:

Agent version Config scenario system-probe
container status
Running binary
7.78 Default configuration Present System-Probe-Lite
7.77 Default configuration Absent N/A
7.78 Discovery explicitly disabled Absent N/A
7.77 Discovery explicitly enabled Present System-Probe
7.78 Other system-probe feature enabled Present System-Probe

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@Yumasi Yumasi added this to the v1.27.0 milestone Apr 20, 2026
@Yumasi Yumasi requested a review from a team April 20, 2026 14:31
@Yumasi Yumasi added the enhancement New feature or request label Apr 20, 2026
@Yumasi Yumasi requested a review from a team as a code owner April 20, 2026 14:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ceec7f0ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/controller/datadogagent/defaults/datadogagent_default.go Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.41%. Comparing base (099f33f) to head (f7e57b4).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2915      +/-   ##
==========================================
+ Coverage   41.38%   41.41%   +0.02%     
==========================================
  Files         327      327              
  Lines       28952    28971      +19     
==========================================
+ Hits        11982    11997      +15     
- Misses      16109    16112       +3     
- Partials      861      862       +1     
Flag Coverage Δ
unittests 41.41% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ller/datadogagent/defaults/datadogagent_default.go 87.03% <ø> (-0.05%) ⬇️
...r/datadogagent/feature/servicediscovery/feature.go 85.29% <100.00%> (+6.12%) ⬆️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 099f33f...f7e57b4. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented Apr 20, 2026

Code Coverage

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 41.54% (+0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f7e57b4 | Docs | Datadog PR Page | Give us feedback!

Copy link
Copy Markdown
Member

@tbavelier tbavelier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we don't touch the defaults directory like this, and we move this logic instead to the feature itself in Configure:

func (f *serviceDiscoveryFeature) Configure(_ metav1.Object, ddaSpec *v2alpha1.DatadogAgentSpec, _ *v2alpha1.RemoteConfigConfiguration) (reqComp feature.RequiredComponents) {
. Would that be possible ? (I didn't do an in-depth review / tested if it would work, happy to be proven wrong of course)

@Yumasi Yumasi requested a review from a team as a code owner April 20, 2026 15:33
@Yumasi Yumasi force-pushed the guillaume.pagnoux/service-discovery-default-enable branch from f8d22aa to a6d4a92 Compare April 20, 2026 16:07
@Yumasi Yumasi force-pushed the guillaume.pagnoux/service-discovery-default-enable branch from a6d4a92 to 3ff8d0b Compare April 20, 2026 16:28
@Yumasi
Copy link
Copy Markdown
Member Author

Yumasi commented Apr 20, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ff8d0b87f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +82 to +83
if version := common.GetAgentVersionFromImage(*nodeAgent.Image); version != "" {
return version
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the inherited tag for name-only image overrides

When override.nodeAgent.image.name is set to a bare image name such as agent and tag is omitted, the pod override path preserves the inherited/default tag, but this branch gets latest from common.GetAgentVersionFromImage and never falls back to images.AgentLatestVersion. Since shouldEnableServiceDiscoveryByDefault treats unparseable versions as supported, clusters still running the current default Agent tag below 7.78 will unexpectedly get system-probe and DD_DISCOVERY_ENABLED just because they used the documented name-only image override; resolve the version from the actual assembled/overridden image or ignore the synthetic latest tag here.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

@Yumasi Yumasi Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want service discovery defaulting to follow the effective node-agent image version, not only explicitly pinned tags. So if no version is provided, including partial image overrides, we intentionally inherit the default agent version. Unparseable provided tags are also treated as latest by design.
Added a comment + tests to ensure this is the intended behaviour.

@Yumasi Yumasi force-pushed the guillaume.pagnoux/service-discovery-default-enable branch from a45a54f to ebe9bb9 Compare April 22, 2026 08:42
@Yumasi Yumasi requested a review from tbavelier April 24, 2026 13:02
Copy link
Copy Markdown
Member

@tbavelier tbavelier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the operator side provided the fallback to sp-lite works as expected

@tbavelier tbavelier dismissed their stale review April 29, 2026 13:28

Removing to unblock, changes are LGTM from the operator side, but fallback is seemingly not working as expected on Agent side from system-probe to system-probe-lite

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7e57b4281

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Agent version must be >= 7.78.0 to enable service discovery by default.
if nodeAgent, ok := ddaSpec.Override[v2alpha1.NodeAgentComponentName]; ok {
if nodeAgent != nil && nodeAgent.Image != nil {
return pkgutils.IsAboveMinVersion(common.GetAgentVersionFromImage(*nodeAgent.Image), serviceDiscoveryAutoEnableMinVersion, ptr.To(true))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve service discovery version from rendered image

When nodeAgent.image.name is a full image reference with a tag, the image rendering path ignores nodeAgent.image.tag, but this check calls GetAgentVersionFromImage, which gives Tag priority. For a valid spec such as name: gcr.io/datadoghq/agent:7.77.2 plus tag: 7.78.0, the pod still runs 7.77.2 while the new defaulting enables Service Discovery and DD_DISCOVERY_USE_SYSTEM_PROBE_LITE as if the Agent were supported; the inverse combination disables it on a supported rendered image. Use the same assembled/overridden image reference that will be put on the pod when deriving the version.

Useful? React with 👍 / 👎.

@tbavelier tbavelier merged commit 6be3f12 into main Apr 30, 2026
38 checks passed
@tbavelier tbavelier deleted the guillaume.pagnoux/service-discovery-default-enable branch April 30, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants