feat(telemetry): integrate Datadog Installer lightweight tracer by julesmcrt · Pull Request #186 · DataDog/rshell

julesmcrt · 2026-04-16T14:17:12Z

Summary

Emit spans for rshell execution via the Datadog Installer's lightweight tracer (pkg/fleet/installer/telemetry), so embedders like PAR can observe restricted-shell runs without wiring a full tracer themselves.
Span layout (operation → resource):
- run → run: Runner.Run envelope, tagged with rshell.version
- command → : each dispatched command, with name, argc, exit_code, is_allowed, is_known, has_stdin_pipe, has_output_redirect tags
- control_flow → if | for | for.iteration | pipeline: flattened grouping spans with structural tags (branch_count/branch_taken, iteration_count/broke_early/variable_name, stage_count/exit_code)

Two telemetry paths

Embedded in datadog-agent (PAR — the path customers hit): rshell itself never constructs a `*Telemetry` and never opens an HTTP connection. PAR constructs the sender once at startup with its own proxy-aware `*http.Client`, and rshell's spans register on the installer tracer's package-global span registry — they flush through PAR's sender. No HTTP wiring ships in the rshell library.
Standalone rshell binary (local use only): ships with telemetry off by default. Build with `-tags with_telemetry` and set `DD_API_KEY` (+ optional `DD_SITE`) to flush to intake. In that build, `cmd/rshell/telemetry_on.go` constructs a `*Telemetry` using `http.DefaultClient` (so `HTTP_PROXY` / `HTTPS_PROXY` are honored) and flushes on exit. This path is strictly for running rshell directly from a terminal — it is never used when rshell is invoked through the datadog-agent.

Why no arg sanitization on `rshell.command.name`

By the time `call()` runs, the shell parser (`mvdan.cc/sh/v3`) has already lexed, expanded, and quote-stripped the input; `args[0]` is a valid command word — a builtin name, a path like `./script.sh`, or a variable-expanded result in command position. Realistically it's never a secret (a customer executing a password as a command is not a failure mode worth engineering around). Metric cardinality is the only residual concern and is bounded in practice.

Test plan

`go test ./...`
`RSHELL_BASH_TEST=1 go test ./tests/ -run TestShellScenariosAgainstBash` (byte-for-byte bash parity, including 5 new scenarios stressing the iterative `execIfChain` walk)
Local end-to-end (standalone path only — not exercised via PAR): `go build -tags with_telemetry -o /tmp/rshell ./cmd/rshell` → `DD_API_KEY="XXX" /tmp/rshell --allow-all-commands -c 'for x in apple banana cherry; do if [ "$x" = banana ]; then echo hi | cat | grep hi; elif [ "$x" = apple ]; then echo elif-branch; true; else echo else-branch; false; fi; done; false || true'` → traces visible under `service:rshell`

🤖 Generated with Claude Code

Wraps Runner.Run and command dispatch in the rshell interpreter with spans emitted via the Datadog Installer's lightweight tracer (github.com/DataDog/datadog-agent/pkg/fleet/installer/telemetry), so embedders like PAR can observe restricted-shell execution without wiring a full tracer. Span layout (operation → resource): - run → run: the Runner.Run invocation envelope, tagged with rshell.version - command → <command name>: each dispatched command, tagged with name, argc, exit_code, is_allowed, is_known, has_stdin_pipe, has_output_redirect - control_flow → if | for | for.iteration | pipeline: flattened grouping spans for the corresponding language constructs, with structural tags (branch_count/branch_taken, iteration_count/broke_early/variable_name, stage_count/exit_code) The if/elif/else handling was refactored from recursion-through-cmd() into an iterative execIfChain walk so one span covers the whole chain. Pipelines use an inPipeline flag on runnerState to suppress nested pipeline spans from the recursive N-stage pipe construction. Standalone binary wiring is gated behind the with_telemetry build tag (cmd/rshell/telemetry_{on,off}.go). In that build, DD_API_KEY and DD_SITE drive a Telemetry sender constructed with http.DefaultClient (proxy-aware). The default release binary ships the no-op variant — spans register on the global tracer but never flush, which is a bounded leak for a short-lived rshell invocation. Coverage: Go tests for the span payload (tags, nesting, parent-child linkage) plus five new byte-for-byte bash-compared scenarios stressing the iterative execIfChain walk (multi-command condition, side-effect short-circuit, 8-way elif chain, assignment in condition, exit mid-chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two structured tags to the run span so operators can slice rshell invocations without parsing error.message strings: rshell.run.exit_code (int) — the interpreter's final exit code. rshell.run.outcome (string enum) — how the run ended: success any script completion (zero or non-zero exit, incl. explicit exit N) unallowed the last command dispatched was blocked by AllowedCommands unknown the last command dispatched was not in the builtin registry timeout context.DeadlineExceeded (rshell --timeout or caller) canceled context.Canceled (caller aborted the run) output_limit stdout cap (ErrOutputLimitExceeded) was hit fatal anything else (pipe creation error, internal panic, …) Only "abnormal" outcomes (timeout/canceled/output_limit/fatal) propagate retErr into span.Finish, so APM error-rate dashboards stay clean — plain non-zero exits and blocked commands no longer count as errors. Operators who care about policy rejections alert on rshell.run.outcome:unallowed instead. "unallowed" / "unknown" are resolved via a new lastDispatchStatus field on runnerState, updated inside call() and propagated up from subshells when a pipeline or (…) adopts a subshell's exit. Last-dispatch semantics mean a blocked command early in a script does not pollute outcome if a later command runs cleanly — matching the conventional shell reading of the final command's status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the "unallowed" / "unknown" outcome categories with two per-run counters on the run span: rshell.run.unallowed_count — commands blocked by AllowedCommands rshell.run.unknown_count — commands missing from the builtin registry Rationale: rshell does not halt on a blocked or unknown command — it prints an error, sets the call's exit to 127, and keeps executing subsequent statements. The earlier "unallowed" / "unknown" outcome, driven by the last-dispatched command's status, missed rejections that happened mid-script when the last command ran cleanly, and could also mask a clean script whose only blip was an intentional policy block. Counters capture the run-wide totals: operators alert on unallowed_count > 0 for policy violations, or unknown_count > 0 for roadmap gaps, without false positives from non-zero exits. Outcome is back to the "shell-ran-fine" taxonomy (success / timeout / canceled / output_limit / fatal) — any script completion is success. Counts are accumulated per Runner and summed into the parent when a pipeline or (…) subshell adopts a child's exit, so pipeline stages on either side of the pipe are both reflected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Splits the per-run command counters into a richer, more accurately scoped set, namespaced under rshell.run.commands.* so the related tags group visually in the Datadog UI: rshell.run.commands.total every call() invocation, regardless of outcome rshell.run.commands.dispatched commands that ran through a builtin fn rshell.run.commands.unallowed commands blocked by AllowedCommands rshell.run.commands.unknown commands not in the builtin registry unallowed and unknown are now counted independently. Previously a command blocked by policy short-circuited before the registry check, so a name that was both blocked and missing (e.g. `pwd` under a PAR allowlist that doesn't include it) only counted as unallowed — hiding the roadmap signal. The new logic mirrors the per-command rshell.command.is_allowed and rshell.command.is_known tags: both are independent facts about the command name, so both counters increment for a command that matches both predicates. The total counter captures every attempted dispatch (= the number of rshell.command spans produced under the run), giving operators an answer to "how many commands did this script attempt?" that is stable regardless of how the outcomes break down. All four counters are summed up from pipeline stages and (…) subshells so the run span carries run-wide totals. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

astuyve

FYI including the telemetry package here uses the agent telemetry instead of http client, which is initialized in telemetry_on.go only when the build tag is applied and in a local/test build context.

## What does this PR do? Adds tracing in the Private Action Runner so each task dispatch emits a span on `service: private-action-runner`. The rshell bundle swaps the context service to `rshell` before invoking the interpreter, so rshell-emitted child spans land on `service: rshell`. ### Span layout ``` action.run (service: private-action-runner) resource: <action FQN> tag: task_id: <uuid> └── <rshell spans from DataDog/rshell#186> (service: rshell) ``` ### Changes - `WorkflowRunner.RunTask` wraps every dispatch in an `action.run` span. Resource = action FQN, `task_id` as a tag. - `rshell.runCommand` sets `service: rshell` on the context before invoking `interp.Runner.Run` so rshell's public callbacks (OnCommandDispatch, OnPipelineStart, …) attribute child spans correctly. - PAR component bootstraps a `fleettelemetry.Telemetry` flusher at startup using `api_key` + configured `site` so spans ship to the instrumentation telemetry intake. - New `pkg/privateactionrunner/observability/traces.go` holds the service and operation name constants. ## Motivation Give PAR and rshell runtime visibility in APM. Enables debugging blocked commands, rshell script performance, and cross-service latency breakdown without log-mining. ## Depends on - Tracer base changes: base of this PR (`jules.macret/Q/lightweight-tracer-improvements`) - rshell callbacks: DataDog/rshell#186 ## Describe how you validated your changes - Unit tests pass (`dda inv test --targets=./pkg/privateactionrunner/...`) - Deployed to a local Kind cluster; verified end-to-end in org2 (US1 prod) APM — `action.run` span with `service: private-action-runner` and rshell child spans with `service: rshell` ## Additional notes Blocked by the two dependencies above. Merge base first, then this PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

julesmcrt marked this pull request as ready for review April 16, 2026 17:29

julesmcrt requested review from AlexandreYang, astuyve, matt-dz, thieman and val06 as code owners April 16, 2026 17:29

astuyve reviewed Apr 16, 2026

View reviewed changes

Comment thread cmd/rshell/main.go Outdated

AlexandreYang reviewed Apr 16, 2026

View reviewed changes

Comment thread go.mod

AlexandreYang reviewed Apr 16, 2026

View reviewed changes

Comment thread interp/call_telemetry_test.go Outdated

matt-dz mentioned this pull request Apr 16, 2026

interp: add CommandObserver option for per-command observation #187

Closed

julesmcrt mentioned this pull request Apr 17, 2026

[PAR] trace private action executions DataDog/datadog-agent#49526

Merged

julesmcrt changed the title ~~feat(telemetry): add builtin usage tracing~~ feat(interp): expose telemetry callback hooks on Runner Apr 17, 2026

julesmcrt force-pushed the jules.macret/Q/builtin-usage-telemetry branch from 9312ae5 to 002ecf8 Compare April 17, 2026 22:33

julesmcrt changed the title ~~feat(interp): expose telemetry callback hooks on Runner~~ feat(telemetry): integrate Datadog Installer lightweight tracer Apr 17, 2026

julesmcrt marked this pull request as draft April 17, 2026 22:34

julesmcrt marked this pull request as ready for review April 17, 2026 22:39

julesmcrt and others added 3 commits April 18, 2026 01:51

julesmcrt mentioned this pull request Apr 20, 2026

bump github.com/DataDog/rshell to v0.0.13 DataDog/datadog-agent#49476

Closed

astuyve approved these changes Apr 20, 2026

View reviewed changes

astuyve added the verified/analysis Human-reviewed static analysis changes label Apr 20, 2026

Bump 3rd party deps

bfa0e73

julesmcrt added this pull request to the merge queue Apr 20, 2026

Merged via the queue into main with commit 22c7c68 Apr 20, 2026
34 checks passed

julesmcrt deleted the jules.macret/Q/builtin-usage-telemetry branch April 20, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): integrate Datadog Installer lightweight tracer#186

feat(telemetry): integrate Datadog Installer lightweight tracer#186
julesmcrt merged 5 commits intomainfrom
jules.macret/Q/builtin-usage-telemetry

julesmcrt commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astuyve left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

julesmcrt commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two telemetry paths

Why no arg sanitization on `rshell.command.name`

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astuyve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

julesmcrt commented Apr 16, 2026 •

edited

Loading