Skip to content

feat(telemetry): integrate Datadog Installer lightweight tracer#186

Merged
julesmcrt merged 5 commits intomainfrom
jules.macret/Q/builtin-usage-telemetry
Apr 20, 2026
Merged

feat(telemetry): integrate Datadog Installer lightweight tracer#186
julesmcrt merged 5 commits intomainfrom
jules.macret/Q/builtin-usage-telemetry

Conversation

@julesmcrt
Copy link
Copy Markdown
Collaborator

@julesmcrt julesmcrt commented Apr 16, 2026

See example trace (from a local run)

Summary

  • Emit spans for rshell execution via the Datadog Installer's lightweight tracer (pkg/fleet/installer/telemetry), so embedders like PAR can observe restricted-shell runs without wiring a full tracer themselves.
  • Span layout (operation → resource):
    • runrun: Runner.Run envelope, tagged with rshell.version
    • command → : each dispatched command, with name, argc, exit_code, is_allowed, is_known, has_stdin_pipe, has_output_redirect tags
    • control_flowif | for | for.iteration | pipeline: flattened grouping spans with structural tags (branch_count/branch_taken, iteration_count/broke_early/variable_name, stage_count/exit_code)

Two telemetry paths

  • Embedded in datadog-agent (PAR — the path customers hit): rshell itself never constructs a `*Telemetry` and never opens an HTTP connection. PAR constructs the sender once at startup with its own proxy-aware `*http.Client`, and rshell's spans register on the installer tracer's package-global span registry — they flush through PAR's sender. No HTTP wiring ships in the rshell library.
  • Standalone rshell binary (local use only): ships with telemetry off by default. Build with `-tags with_telemetry` and set `DD_API_KEY` (+ optional `DD_SITE`) to flush to intake. In that build, `cmd/rshell/telemetry_on.go` constructs a `*Telemetry` using `http.DefaultClient` (so `HTTP_PROXY` / `HTTPS_PROXY` are honored) and flushes on exit. This path is strictly for running rshell directly from a terminal — it is never used when rshell is invoked through the datadog-agent.

Why no arg sanitization on `rshell.command.name`

By the time `call()` runs, the shell parser (`mvdan.cc/sh/v3`) has already lexed, expanded, and quote-stripped the input; `args[0]` is a valid command word — a builtin name, a path like `./script.sh`, or a variable-expanded result in command position. Realistically it's never a secret (a customer executing a password as a command is not a failure mode worth engineering around). Metric cardinality is the only residual concern and is bounded in practice.

Test plan

  • `go test ./...`
  • `RSHELL_BASH_TEST=1 go test ./tests/ -run TestShellScenariosAgainstBash` (byte-for-byte bash parity, including 5 new scenarios stressing the iterative `execIfChain` walk)
  • Local end-to-end (standalone path only — not exercised via PAR): `go build -tags with_telemetry -o /tmp/rshell ./cmd/rshell` → `DD_API_KEY="XXX" /tmp/rshell --allow-all-commands -c 'for x in apple banana cherry; do if [ "$x" = banana ]; then echo hi | cat | grep hi; elif [ "$x" = apple ]; then echo elif-branch; true; else echo else-branch; false; fi; done; false || true'` → traces visible under `service:rshell`

🤖 Generated with Claude Code

@julesmcrt julesmcrt marked this pull request as ready for review April 16, 2026 17:29
Comment thread cmd/rshell/main.go Outdated
Comment thread go.mod
Comment thread interp/call_telemetry_test.go Outdated
@julesmcrt julesmcrt changed the title feat(telemetry): add builtin usage tracing feat(interp): expose telemetry callback hooks on Runner Apr 17, 2026
Wraps Runner.Run and command dispatch in the rshell interpreter with
spans emitted via the Datadog Installer's lightweight tracer
(github.com/DataDog/datadog-agent/pkg/fleet/installer/telemetry), so
embedders like PAR can observe restricted-shell execution without
wiring a full tracer.

Span layout (operation → resource):
- run → run: the Runner.Run invocation envelope, tagged with rshell.version
- command → <command name>: each dispatched command, tagged with name,
  argc, exit_code, is_allowed, is_known, has_stdin_pipe, has_output_redirect
- control_flow → if | for | for.iteration | pipeline: flattened grouping
  spans for the corresponding language constructs, with structural tags
  (branch_count/branch_taken, iteration_count/broke_early/variable_name,
  stage_count/exit_code)

The if/elif/else handling was refactored from recursion-through-cmd() into
an iterative execIfChain walk so one span covers the whole chain. Pipelines
use an inPipeline flag on runnerState to suppress nested pipeline spans
from the recursive N-stage pipe construction.

Standalone binary wiring is gated behind the with_telemetry build tag
(cmd/rshell/telemetry_{on,off}.go). In that build, DD_API_KEY and DD_SITE
drive a Telemetry sender constructed with http.DefaultClient (proxy-aware).
The default release binary ships the no-op variant — spans register on the
global tracer but never flush, which is a bounded leak for a short-lived
rshell invocation.

Coverage: Go tests for the span payload (tags, nesting, parent-child
linkage) plus five new byte-for-byte bash-compared scenarios stressing the
iterative execIfChain walk (multi-command condition, side-effect
short-circuit, 8-way elif chain, assignment in condition, exit mid-chain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@julesmcrt julesmcrt force-pushed the jules.macret/Q/builtin-usage-telemetry branch from 9312ae5 to 002ecf8 Compare April 17, 2026 22:33
@julesmcrt julesmcrt changed the title feat(interp): expose telemetry callback hooks on Runner feat(telemetry): integrate Datadog Installer lightweight tracer Apr 17, 2026
@julesmcrt julesmcrt marked this pull request as draft April 17, 2026 22:34
@julesmcrt julesmcrt marked this pull request as ready for review April 17, 2026 22:39
julesmcrt and others added 3 commits April 18, 2026 01:51
Adds two structured tags to the run span so operators can slice rshell
invocations without parsing error.message strings:

  rshell.run.exit_code (int) — the interpreter's final exit code.
  rshell.run.outcome   (string enum) — how the run ended:
    success      any script completion (zero or non-zero exit, incl. explicit exit N)
    unallowed    the last command dispatched was blocked by AllowedCommands
    unknown      the last command dispatched was not in the builtin registry
    timeout      context.DeadlineExceeded (rshell --timeout or caller)
    canceled     context.Canceled (caller aborted the run)
    output_limit stdout cap (ErrOutputLimitExceeded) was hit
    fatal        anything else (pipe creation error, internal panic, …)

Only "abnormal" outcomes (timeout/canceled/output_limit/fatal) propagate
retErr into span.Finish, so APM error-rate dashboards stay clean —
plain non-zero exits and blocked commands no longer count as errors.
Operators who care about policy rejections alert on
rshell.run.outcome:unallowed instead.

"unallowed" / "unknown" are resolved via a new lastDispatchStatus field
on runnerState, updated inside call() and propagated up from subshells
when a pipeline or (…) adopts a subshell's exit. Last-dispatch semantics
mean a blocked command early in a script does not pollute outcome if a
later command runs cleanly — matching the conventional shell reading of
the final command's status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the "unallowed" / "unknown" outcome categories with two
per-run counters on the run span:

  rshell.run.unallowed_count — commands blocked by AllowedCommands
  rshell.run.unknown_count   — commands missing from the builtin registry

Rationale: rshell does not halt on a blocked or unknown command — it
prints an error, sets the call's exit to 127, and keeps executing
subsequent statements. The earlier "unallowed" / "unknown" outcome,
driven by the last-dispatched command's status, missed rejections that
happened mid-script when the last command ran cleanly, and could also
mask a clean script whose only blip was an intentional policy block.

Counters capture the run-wide totals: operators alert on
unallowed_count > 0 for policy violations, or unknown_count > 0 for
roadmap gaps, without false positives from non-zero exits. Outcome is
back to the "shell-ran-fine" taxonomy (success / timeout / canceled /
output_limit / fatal) — any script completion is success.

Counts are accumulated per Runner and summed into the parent when a
pipeline or (…) subshell adopts a child's exit, so pipeline stages on
either side of the pipe are both reflected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits the per-run command counters into a richer, more accurately
scoped set, namespaced under rshell.run.commands.* so the related
tags group visually in the Datadog UI:

  rshell.run.commands.total       every call() invocation, regardless of outcome
  rshell.run.commands.dispatched  commands that ran through a builtin fn
  rshell.run.commands.unallowed   commands blocked by AllowedCommands
  rshell.run.commands.unknown     commands not in the builtin registry

unallowed and unknown are now counted independently. Previously a
command blocked by policy short-circuited before the registry check,
so a name that was both blocked and missing (e.g. `pwd` under a PAR
allowlist that doesn't include it) only counted as unallowed — hiding
the roadmap signal. The new logic mirrors the per-command
rshell.command.is_allowed and rshell.command.is_known tags: both are
independent facts about the command name, so both counters increment
for a command that matches both predicates.

The total counter captures every attempted dispatch (= the number of
rshell.command spans produced under the run), giving operators an
answer to "how many commands did this script attempt?" that is
stable regardless of how the outcomes break down.

All four counters are summed up from pipeline stages and (…)
subshells so the run span carries run-wide totals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@astuyve astuyve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI including the telemetry package here uses the agent telemetry instead of http client, which is initialized in telemetry_on.go only when the build tag is applied and in a local/test build context.

@astuyve astuyve added the verified/analysis Human-reviewed static analysis changes label Apr 20, 2026
@julesmcrt julesmcrt added this pull request to the merge queue Apr 20, 2026
Merged via the queue into main with commit 22c7c68 Apr 20, 2026
34 checks passed
@julesmcrt julesmcrt deleted the jules.macret/Q/builtin-usage-telemetry branch April 20, 2026 16:16
julesmcrt added a commit to DataDog/datadog-agent that referenced this pull request Apr 20, 2026
## What does this PR do?

Adds tracing in the Private Action Runner so each task dispatch emits a
span on `service: private-action-runner`. The rshell bundle swaps the
context service to `rshell` before invoking the interpreter, so
rshell-emitted child spans land on `service: rshell`.

### Span layout

```
action.run                          (service: private-action-runner)
  resource: <action FQN>
  tag:      task_id: <uuid>
  └── <rshell spans from DataDog/rshell#186>   (service: rshell)
```

### Changes

- `WorkflowRunner.RunTask` wraps every dispatch in an `action.run` span.
Resource = action FQN, `task_id` as a tag.
- `rshell.runCommand` sets `service: rshell` on the context before
invoking `interp.Runner.Run` so rshell's public callbacks
(OnCommandDispatch, OnPipelineStart, …) attribute child spans correctly.
- PAR component bootstraps a `fleettelemetry.Telemetry` flusher at
startup using `api_key` + configured `site` so spans ship to the
instrumentation telemetry intake.
- New `pkg/privateactionrunner/observability/traces.go` holds the
service and operation name constants.

## Motivation

Give PAR and rshell runtime visibility in APM. Enables debugging blocked
commands, rshell script performance, and cross-service latency breakdown
without log-mining.

## Depends on

- Tracer base changes: base of this PR
(`jules.macret/Q/lightweight-tracer-improvements`)
- rshell callbacks: DataDog/rshell#186

## Describe how you validated your changes

- Unit tests pass (`dda inv test
--targets=./pkg/privateactionrunner/...`)
- Deployed to a local Kind cluster; verified end-to-end in org2 (US1
prod) APM — `action.run` span with `service: private-action-runner` and
rshell child spans with `service: rshell`

## Additional notes

Blocked by the two dependencies above. Merge base first, then this PR.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verified/analysis Human-reviewed static analysis changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants