feat(telemetry): integrate Datadog Installer lightweight tracer#186
Merged
feat(telemetry): integrate Datadog Installer lightweight tracer#186
Conversation
astuyve
reviewed
Apr 16, 2026
Wraps Runner.Run and command dispatch in the rshell interpreter with
spans emitted via the Datadog Installer's lightweight tracer
(github.com/DataDog/datadog-agent/pkg/fleet/installer/telemetry), so
embedders like PAR can observe restricted-shell execution without
wiring a full tracer.
Span layout (operation → resource):
- run → run: the Runner.Run invocation envelope, tagged with rshell.version
- command → <command name>: each dispatched command, tagged with name,
argc, exit_code, is_allowed, is_known, has_stdin_pipe, has_output_redirect
- control_flow → if | for | for.iteration | pipeline: flattened grouping
spans for the corresponding language constructs, with structural tags
(branch_count/branch_taken, iteration_count/broke_early/variable_name,
stage_count/exit_code)
The if/elif/else handling was refactored from recursion-through-cmd() into
an iterative execIfChain walk so one span covers the whole chain. Pipelines
use an inPipeline flag on runnerState to suppress nested pipeline spans
from the recursive N-stage pipe construction.
Standalone binary wiring is gated behind the with_telemetry build tag
(cmd/rshell/telemetry_{on,off}.go). In that build, DD_API_KEY and DD_SITE
drive a Telemetry sender constructed with http.DefaultClient (proxy-aware).
The default release binary ships the no-op variant — spans register on the
global tracer but never flush, which is a bounded leak for a short-lived
rshell invocation.
Coverage: Go tests for the span payload (tags, nesting, parent-child
linkage) plus five new byte-for-byte bash-compared scenarios stressing the
iterative execIfChain walk (multi-command condition, side-effect
short-circuit, 8-way elif chain, assignment in condition, exit mid-chain).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9312ae5 to
002ecf8
Compare
Adds two structured tags to the run span so operators can slice rshell
invocations without parsing error.message strings:
rshell.run.exit_code (int) — the interpreter's final exit code.
rshell.run.outcome (string enum) — how the run ended:
success any script completion (zero or non-zero exit, incl. explicit exit N)
unallowed the last command dispatched was blocked by AllowedCommands
unknown the last command dispatched was not in the builtin registry
timeout context.DeadlineExceeded (rshell --timeout or caller)
canceled context.Canceled (caller aborted the run)
output_limit stdout cap (ErrOutputLimitExceeded) was hit
fatal anything else (pipe creation error, internal panic, …)
Only "abnormal" outcomes (timeout/canceled/output_limit/fatal) propagate
retErr into span.Finish, so APM error-rate dashboards stay clean —
plain non-zero exits and blocked commands no longer count as errors.
Operators who care about policy rejections alert on
rshell.run.outcome:unallowed instead.
"unallowed" / "unknown" are resolved via a new lastDispatchStatus field
on runnerState, updated inside call() and propagated up from subshells
when a pipeline or (…) adopts a subshell's exit. Last-dispatch semantics
mean a blocked command early in a script does not pollute outcome if a
later command runs cleanly — matching the conventional shell reading of
the final command's status.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the "unallowed" / "unknown" outcome categories with two per-run counters on the run span: rshell.run.unallowed_count — commands blocked by AllowedCommands rshell.run.unknown_count — commands missing from the builtin registry Rationale: rshell does not halt on a blocked or unknown command — it prints an error, sets the call's exit to 127, and keeps executing subsequent statements. The earlier "unallowed" / "unknown" outcome, driven by the last-dispatched command's status, missed rejections that happened mid-script when the last command ran cleanly, and could also mask a clean script whose only blip was an intentional policy block. Counters capture the run-wide totals: operators alert on unallowed_count > 0 for policy violations, or unknown_count > 0 for roadmap gaps, without false positives from non-zero exits. Outcome is back to the "shell-ran-fine" taxonomy (success / timeout / canceled / output_limit / fatal) — any script completion is success. Counts are accumulated per Runner and summed into the parent when a pipeline or (…) subshell adopts a child's exit, so pipeline stages on either side of the pipe are both reflected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits the per-run command counters into a richer, more accurately scoped set, namespaced under rshell.run.commands.* so the related tags group visually in the Datadog UI: rshell.run.commands.total every call() invocation, regardless of outcome rshell.run.commands.dispatched commands that ran through a builtin fn rshell.run.commands.unallowed commands blocked by AllowedCommands rshell.run.commands.unknown commands not in the builtin registry unallowed and unknown are now counted independently. Previously a command blocked by policy short-circuited before the registry check, so a name that was both blocked and missing (e.g. `pwd` under a PAR allowlist that doesn't include it) only counted as unallowed — hiding the roadmap signal. The new logic mirrors the per-command rshell.command.is_allowed and rshell.command.is_known tags: both are independent facts about the command name, so both counters increment for a command that matches both predicates. The total counter captures every attempted dispatch (= the number of rshell.command spans produced under the run), giving operators an answer to "how many commands did this script attempt?" that is stable regardless of how the outcomes break down. All four counters are summed up from pipeline stages and (…) subshells so the run span carries run-wide totals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
julesmcrt
added a commit
to DataDog/datadog-agent
that referenced
this pull request
Apr 20, 2026
## What does this PR do? Adds tracing in the Private Action Runner so each task dispatch emits a span on `service: private-action-runner`. The rshell bundle swaps the context service to `rshell` before invoking the interpreter, so rshell-emitted child spans land on `service: rshell`. ### Span layout ``` action.run (service: private-action-runner) resource: <action FQN> tag: task_id: <uuid> └── <rshell spans from DataDog/rshell#186> (service: rshell) ``` ### Changes - `WorkflowRunner.RunTask` wraps every dispatch in an `action.run` span. Resource = action FQN, `task_id` as a tag. - `rshell.runCommand` sets `service: rshell` on the context before invoking `interp.Runner.Run` so rshell's public callbacks (OnCommandDispatch, OnPipelineStart, …) attribute child spans correctly. - PAR component bootstraps a `fleettelemetry.Telemetry` flusher at startup using `api_key` + configured `site` so spans ship to the instrumentation telemetry intake. - New `pkg/privateactionrunner/observability/traces.go` holds the service and operation name constants. ## Motivation Give PAR and rshell runtime visibility in APM. Enables debugging blocked commands, rshell script performance, and cross-service latency breakdown without log-mining. ## Depends on - Tracer base changes: base of this PR (`jules.macret/Q/lightweight-tracer-improvements`) - rshell callbacks: DataDog/rshell#186 ## Describe how you validated your changes - Unit tests pass (`dda inv test --targets=./pkg/privateactionrunner/...`) - Deployed to a local Kind cluster; verified end-to-end in org2 (US1 prod) APM — `action.run` span with `service: private-action-runner` and rshell child spans with `service: rshell` ## Additional notes Blocked by the two dependencies above. Merge base first, then this PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See example trace (from a local run)
Summary
pkg/fleet/installer/telemetry), so embedders like PAR can observe restricted-shell runs without wiring a full tracer themselves.run→run: Runner.Run envelope, tagged withrshell.versioncommand→ : each dispatched command, withname,argc,exit_code,is_allowed,is_known,has_stdin_pipe,has_output_redirecttagscontrol_flow→if | for | for.iteration | pipeline: flattened grouping spans with structural tags (branch_count/branch_taken,iteration_count/broke_early/variable_name,stage_count/exit_code)Two telemetry paths
Why no arg sanitization on `rshell.command.name`
By the time `call()` runs, the shell parser (`mvdan.cc/sh/v3`) has already lexed, expanded, and quote-stripped the input; `args[0]` is a valid command word — a builtin name, a path like `./script.sh`, or a variable-expanded result in command position. Realistically it's never a secret (a customer executing a password as a command is not a failure mode worth engineering around). Metric cardinality is the only residual concern and is bounded in practice.
Test plan
🤖 Generated with Claude Code