feat(config): bring dd-config validation checks natively into the agent#47707
Conversation
…tal check-config/onboard
Ports missing validation features from dd-config into two new hidden,
undocumented subcommands under `agent experimental`. Neither command
appears in `agent --help` or `agent experimental --help`.
agent experimental check-config -c <dir> [--no-api]
agent experimental onboard -c <dir> [--no-api]
Both run a 6-stage validation pipeline on datadog.yaml without requiring
a running agent:
1. File permission warnings (world-readable detection)
2. YAML syntax validation with tab detection and friendly error messages
3. API key format check — tolerates empty (cloud-based auth) and ENC[]
(secret backend); uses pkg/util/scrubber for display masking
4. Site/region validation via regexp (same pattern as
pkg/config/utils/endpoints.go) — forward-compatible with new DCs
5. Live API key validation via HTTP using BuildURLWithPrefix; skippable
with --no-api
6. Product enablement summary (APM, Logs, Live Process, CSPM, CWS)
noAPICheck is scoped to experimentalParams only and does not appear in
the shared cliParams struct used by other agent config subcommands.
Commands are hidden pending a feature flag decision. Code is fully isolated:
check.go and experimental.go are new files with no coupling to existing
runtime config commands.
New dependency: pkg/util/scrubber and pkg/config/utils (both already in
go.mod via replace directives; no net-new external deps).
Schema validation (Phase 2d) excluded — jsonschema/v6 would add ~2MB to
the binary; deferred to a future decision.
15 unit tests, 0 linter issues.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Kunal Karandikar <first.last@datadoghq.com>
Signed-off-by: Kunal Karandikar <kunal.karandikar@datadoghq.com>
|
All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8c723140c1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| site, siteValid := checkSite(cfg) | ||
|
|
||
| // 5. Live API key validation (skip if --no-api, or if key/site are invalid) | ||
| if !cliParams.noAPICheck && !hasErrors && siteValid { |
There was a problem hiding this comment.
Treat unknown site as a failing config check
When checkSite reports an unknown site, runConfigCheck does not mark the run as failed; siteValid is only used to skip the live API call, so hasErrors can remain false and the command exits 0 even after printing [ERR] site. This means a config whose only problem is an invalid region value can incorrectly pass CI/automation that relies on the exit code.
Useful? React with 👍 / 👎.
|
|
||
| // runConfigSetup launches the interactive TUI wizard for editing datadog.yaml. | ||
| func runConfigSetup(_ log.Component, cfg config.Component, cliParams *cliParams) error { | ||
| configPath := cfg.ConfigFileUsed() |
There was a problem hiding this comment.
Resolve writable config path when no file was loaded
runConfigSetup uses cfg.ConfigFileUsed() as the write target, but that value is empty when no datadog.yaml was found (for example on first-time setup or when -c points to a directory without the file). In that case applySetupResult eventually calls os.WriteFile with an empty path and the wizard cannot create a new config file, which breaks the advertised "create or edit" flow.
Useful? React with 👍 / 👎.
Go Package Import DifferencesBaseline: cc12a65
|
Files inventory check summaryFile checks results against ancestor cc12a65a: Results for datadog-agent_7.79.0~devel.git.72.800b673.pipeline.104311618-1_amd64.deb:No change detected |
|
Hi @kunalkxxxxxxxxr, if you run |
8c72314 to
5966345
Compare
|
Code isolation check
|
5966345 to
4b488b5
Compare
4b488b5 to
5aad7f0
Compare
5aad7f0 to
9e22639
Compare
825d463 to
c3b08a1
Compare
|
I have read the CLA Document and I hereby sign the CLA |
c3b08a1 to
ba5f19f
Compare
ba5f19f to
3e044d1
Compare
547b5c8 to
3be42fd
Compare
|
|
||
| // checkFilePermissions warns if the config file is world-readable. | ||
| // Returns a warning string (empty if permissions look good). | ||
| // hush-hush: Removed sudo suggestion. Only call this on non-Windows systems; |
There was a problem hiding this comment.
💬 suggestion
| // hush-hush: Removed sudo suggestion. Only call this on non-Windows systems; |
Looks like a dev comment, should it be removed?
| func checkFilePermissions(path string) string { | ||
| info, err := os.Stat(path) | ||
| if err != nil { | ||
| return "" | ||
| } | ||
| if info.Mode()&0o007 != 0 { | ||
| return fmt.Sprintf("config file is world-readable (mode %s) — API key may be exposed. Fix: chmod 640 %s", info.Mode(), path) | ||
| } | ||
| return "" | ||
| } |
There was a problem hiding this comment.
💬 suggestion
Instead of returning a string, would be more Go-like and readable to return a pair of boolean, error. If the check is ok, it returns true, nil. Otherwise it returns false and give context in the error. This way you can stack errors
| func checkFilePermissions(path string) string { | |
| info, err := os.Stat(path) | |
| if err != nil { | |
| return "" | |
| } | |
| if info.Mode()&0o007 != 0 { | |
| return fmt.Sprintf("config file is world-readable (mode %s) — API key may be exposed. Fix: chmod 640 %s", info.Mode(), path) | |
| } | |
| return "" | |
| } | |
| func checkFilePermissions(path string) (bool, error) { | |
| // this should not run on windows, raise an error | |
| if runtime.GOOS == "windows" { | |
| return false, errors.New("File permissions check not implemented on Windows") | |
| } | |
| info, err := os.Stat(path) | |
| if err != nil { | |
| return false, err | |
| } | |
| if info.Mode()&0o007 != 0 { | |
| return false, fmt.Errorf("config file is world-readable (mode %s) — API key may be exposed. Fix: chmod 640 %s", info.Mode(), path) | |
| } | |
| return true, nil | |
| } |
| // line, and a plain-English description of the problem. | ||
| func buildFriendlyYAMLError(yamlMsg string, lines []string) error { | ||
| var lineNum int | ||
| fmt.Sscanf(yamlMsg, "yaml: line %d:", &lineNum) //nolint:errcheck |
There was a problem hiding this comment.
💬 suggestion
Claude and AI agents tend to find the shortest path to validation. In this case, instead of adding an error check as suggested by the linter, it mutes it. Let's implement it.
| fmt.Sscanf(yamlMsg, "yaml: line %d:", &lineNum) //nolint:errcheck | |
| _, err := fmt.Sscanf(yamlMsg, "yaml: line %d:", &lineNum) | |
| if err != nil { | |
| return err | |
| } |
There was a problem hiding this comment.
💬 suggestion
You are not using lineNum until row 73, let's move it there
| // buildFriendlyYAMLError converts a raw yaml.v3 error into a human-readable | ||
| // message that includes the line number, the actual content of the offending | ||
| // line, and a plain-English description of the problem. | ||
| func buildFriendlyYAMLError(yamlMsg string, lines []string) error { |
There was a problem hiding this comment.
💬 suggestion
Better returning a string, rather than an error. You don't stack errors here, and you always return something, never a nil error
| case strings.Contains(yamlMsg, "found character that cannot start any token"): | ||
| description = "tab character used for indentation" | ||
| fix = "YAML requires spaces for indentation, not tabs — replace all tabs with spaces" | ||
| case strings.Contains(yamlMsg, "mapping values are not allowed"): | ||
| description = "incorrect indentation or missing space after colon" | ||
| fix = "check for missing spaces after colons or incorrect nesting" | ||
| case strings.Contains(yamlMsg, "did not find expected key"): | ||
| description = "unexpected indentation level" | ||
| fix = "a nested key may be at the wrong indentation level" | ||
| default: | ||
| description = yamlMsg | ||
| fix = "check indentation (use spaces, not tabs) and colon placement" |
There was a problem hiding this comment.
🔨 warning
If the yamlMsg contains more than one of these, you will overwrite them. A better implementation would be to rather loop on possible messages and stack them in an array. Then build the message based on all the errors found.
Also a bit worried about this hardcoded implementation, as it strongly depends on the yaml output. One guess have is that, if the code runs on an OS whom language is set to FR or anything else than EN, this will not work.
| func validateAPIKey(apiKey string) error { | ||
| if apiKey == "" { | ||
| // Empty may be valid with cloud-based authentication — warn, don't error. | ||
| fmt.Printf("[WARN] api_key: not set — may be valid with cloud-based auth, but required for most configurations\n") |
There was a problem hiding this comment.
🔨 warning
This string should not go straight to output. It should be returned as output, and handled by the caller printed through logs
| fmt.Printf("[OK] site: '%s' is a valid Datadog site\n", site) | ||
| return site, true | ||
| } | ||
| fmt.Printf("[ERR] site: '%s' does not appear to be a valid Datadog site\n", site) |
There was a problem hiding this comment.
🔨 warning
Same here, you cannot mute these logs, not ideal.
| // hush-hush: Use cfg.GetBool() directly rather than traversing the raw YAML map. | ||
| // hush-hush: Simplified product summary: show the "no products" notice then use | ||
| // a single loop for both cases, displaying [X] for disabled in both scenarios. | ||
| func checkEnabledProducts(cfg config.Component) { |
There was a problem hiding this comment.
💬 suggestion
Here too return bool + error
| for _, name := range []string{"check-config", "onboard"} { | ||
| sub := &cobra.Command{Use: name, Hidden: true, RunE: runE} | ||
| sub.Flags().BoolVar(&ep.noAPICheck, "no-api", false, "") | ||
| experimentalCmd.AddCommand(sub) | ||
| } |
There was a problem hiding this comment.
❓ question
Why onboard and check-config implement the same command runE? Why not just one command?
3be42fd to
671795f
Compare
| } | ||
| if len(descriptions) == 0 { | ||
| descriptions = []string{yamlMsg} | ||
| fixes = []string{"check indentation (use spaces, not tabs) and colon placement"} |
There was a problem hiding this comment.
❓ question
if descriptions is empty, it means it did not find
{
contains: "found character that cannot start any token",
description: "tab character used for indentation",
fix: "YAML requires spaces for indentation, not tabs — replace all tabs with spaces",
},
why is this fix suggesting to check indentation anyway?
There was a problem hiding this comment.
The fallback fix message will no longer incorrectly suggest checking indentation — it now says "refer to the YAML error above for details" since the actual cause is unknown.
| } | ||
| if !apiKeyRegex.MatchString(apiKey) { | ||
| err := fmt.Errorf("api_key format is invalid (got %d chars, expected 32 hex characters)", len(apiKey)) | ||
| return err.Error(), err |
There was a problem hiding this comment.
💬 suggestion
No need to return err.Error() as string
| return err.Error(), err | |
| return "", err |
| // checkSite validates the configured site value. Returns the site string | ||
| // (defaulting to datadoghq.com if unset) and whether the site appears valid. | ||
| // Returns (site, valid, message) — message is for the caller to log. | ||
| func checkSite(cfg config.Component) (site string, valid bool, message string) { |
There was a problem hiding this comment.
💬 suggestion
For coherence with other checks
| func checkSite(cfg config.Component) (site string, valid bool, message string) { | |
| func checkSite(cfg config.Component) (valid bool, site string, message string) { |
| // Empty keys and ENC[] keys are not errors. | ||
| // Uses HideKeyExceptLastFiveChars from the agent's scrubber package | ||
| // for consistent API key masking. | ||
| func validateAPIKey(apiKey string) (string, error) { |
There was a problem hiding this comment.
💬 suggestion
To help interacting with this function, better explicit what the string returns. Would also add a boolean for coherence with other checks
| func validateAPIKey(apiKey string) (string, error) { | |
| func validateAPIKey(apiKey string) (message string, err error) { |
671795f to
0b4b611
Compare
|
|
||
| dir := t.TempDir() | ||
| path := filepath.Join(dir, "datadog.yaml") | ||
| require.NoError(t, os.WriteFile(path, []byte("api_key: test\n"), 0640)) |
There was a problem hiding this comment.
💬 suggestion
To properly clean up after test, I would remove the file in a defered call
| require.NoError(t, os.WriteFile(path, []byte("api_key: test\n"), 0640)) | |
| require.NoError(t, os.WriteFile(path, []byte("api_key: test\n"), 0640)) | |
| defer require.NoError(t, os.RemoveFile(path)) |
There was a problem hiding this comment.
Done — added the deferred cleanup. Two small corrections from your suggestion: os.RemoveFile doesn't exist in Go (it's os.Remove), and the closure form is needed so os.Remove runs at defer-time rather than immediately: defer func() { require.NoError(t, os.Remove(path)) }().
| lines := []string{ | ||
| "api_key: abcdef1234567890abcdef1234567890", | ||
| "site: datadoghq.com", | ||
| "apm_config:", | ||
| "\tenabled: true", | ||
| } |
There was a problem hiding this comment.
Good call — done. Created testdata/yaml_error_test.yaml (with a literal tab on line 4 for the tab-character test case) and switched to //go:embed + strings.Split(strings.TrimRight(...)) to load it. All four sub-tests still pass.
| assert.NotEmpty(t, p.name, "product name must not be empty") | ||
| } | ||
| }) | ||
| } |
There was a problem hiding this comment.
❓ question
What happens if a config file has multiple errors?
There was a problem hiding this comment.
Good question. runConfigCheck runs all validation stages and accumulates errors — it does not short-circuit after the first failure.
Stages 3 (API key format) and 4 (site) both set hasErrors = true and continue, so a config with e.g. both a bad API key and a bad site will surface both [ERR] lines before the function returns.
The one exception is a YAML parse failure at stage 2: since later stages depend on a valid, parsed config, we return early there with a clear error message.
I've added TestMultipleErrorsAllReported to cover this — it verifies each check produces an independent error and documents the early-return exception for YAML.
Why are we skipping this on Windows? if you're not sure how to implement then #windows-products can help :) |
ba5e79f to
bb9c647
Compare
da4894b to
800b673
Compare
What does this PR do?
Adds new experimental functionality to the Datadog Agent CLI that implements a 6-stage config validation pipeline for
datadog.yaml, without requiring a running agent. The entry point is:Validation stages run in order:
datadog.yamlis world-readable (API key exposure risk); skipped on Windows[ERR]formatENC[]secret backend notation; usesscrubber.HideKeyExceptLastFiveCharsfor display maskingcfg.GetBool()to show which products are enabled; when none are configured, clearly states "No products are enabled. Sending only Metrics."Motivation
agent configcurrently only supports runtime config inspection (IPC to a running agent).dd-config— a standalone Datadog CLI — has a richer validation pipeline that catches common misconfigurations before the agent starts. This PR ports the most impactful checks into the agent binary itself so users have a single tool.Schema validation (2200+ keys via
santhosh-tekuri/jsonschema/v6) was prototyped but excluded due to the ~2MB binary size increase. The charmbracelet TUI wizard was also prototyped but excluded — it requires bubbletea + lipgloss and is better suited to a standalone tool likedd-config.Describe how you validated your changes
Unit tests (8 test functions, 25 cases — all passing):
TestCheckYAMLSyntax— 8 sub-cases: valid YAML, tabs, missing colons, bad indentationTestBuildFriendlyYAMLError— 4 sub-cases: line number extraction, human-readable messagesTestCheckFilePermissions— world-readable warning, nosudoin message; skipped on WindowsTestValidSiteRe— regexp matches all known Datadog sites, rejects unknown domainsTestValidateAPIKey— 5 sub-cases: valid key, empty key,ENC[], too short, non-hexTestFormatLineNumbers— line number formatting helperfxutil.TestOneShotSubcommandManual phase testing using fixture files:
[ERR] yaml_syntax: YAML syntax error on line 4: tab character... content: "\tenabled: true"[OK][ERR] api_key: format is invalid (got 8 chars...)[OK] api_key[ERR] site: does not appear to be a valid Datadog site[OK] site[WARN] permissions: world-readable[OK] permissionsNo products are enabled. Sending only Metrics.+[X]for all✓ Log collection ✓ APM ✓ Live ProcessAdditional Notes
showRuntimeConfiguration,listRuntimeConfigurableValue,setConfigValue,getConfigValue,otelAgentCfg) — zero couplingcfg.GetBool()directly per reviewer feedbackcmd.Root().SetErr(io.Discard)prevents the error from being printed twice byruncmd.displayErrorscrubber.HideKeyExceptLastFiveChars(existing agent utility)pkg/config/utils/endpoints.go(ddDomainPattern) for forward-compatibility with future Datadog datacenters