Skip to content

[otel-advisor] OTel improvement: emit agent execution sub-span for cancelled workflow runsΒ #28168

@github-actions

Description

@github-actions

πŸ“‘ OTel Instrumentation Improvement: emit gh-aw.agent.agent sub-span for cancelled runs

Analysis Date: 2026-04-23
Priority: Medium
Effort: Small (< 2h)

Problem

When a workflow run is manually cancelled while the agent is executing, the gh-aw.agent.agent sub-span is not emitted β€” even though a meaningful duration of agent execution occurred before cancellation. An on-call engineer looking at a cancelled trace sees only the conclusion span with gh-aw.agent.conclusion = "cancelled", with no way to answer: "How long did the agent run before we cancelled it?"

The root cause is in send_otlp_span.cjs at the statSync fallback (lines 855–863). When agent_output.json does not exist (the agent was killed before writing output), the fallback agentEndMs = nowMs() is only set for isAgentFailure ("failure" or "timed_out"). The isAgentCancelled case is not included, leaving agentEndMs = null and suppressing the sub-span.

Why This Matters (DevOps Perspective)

Cancelled runs are precisely the case where execution duration is most operationally important: if an engineer cancelled a job "because it seemed stuck", they need to know how long it had actually been running to determine whether the cancellation was warranted or premature. Without the agent sub-span, the trace shows only a status.code=2, status.message="agent cancelled" conclusion span β€” no duration of agent execution, no timeline anchoring. This extends MTTR for investigations into: "Was the cancellation early? Did the agent make progress? How close was it to finishing?"

The timed_out case (which already emits the sub-span via the isAgentFailure path) is the exact same pattern β€” the agent is killed before writing agent_output.json, and the sub-span is still useful to show how long it ran.

Current Behavior
// actions/setup/js/send_otlp_span.cjs (lines 855–869)
let agentEndMs = null;
try {
  agentEndMs = fs.statSync("/tmp/gh-aw/agent_output.json").mtimeMs;
} catch {
  // agent_output.json may be absent for agent failures, including timed-out
  // runs where the process was killed before writing output. Fall back to
  // nowMs() so we still emit the dedicated agent span for these failures.
  if (isAgentFailure && jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0) {
    agentEndMs = nowMs();
    //  ^^^ isAgentCancelled is NOT included here β€” cancelled runs get no sub-span
  }
}

// Condition to emit agent sub-span β€” fails when agentEndMs is null (cancelled case)
if (jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0 && typeof agentEndMs === "number" && agentEndMs > agentStartMs) {

Where:

const isAgentFailure  = agentConclusion === "failure" || agentConclusion === "timed_out";
const isAgentCancelled = agentConclusion === "cancelled";  // not used in the fallback
Proposed Change

One-line fix: extend the isAgentFailure guard to also cover isAgentCancelled.

// actions/setup/js/send_otlp_span.cjs β€” catch block inside sendJobConclusionSpan
} catch {
  // agent_output.json may be absent for agent failures (including timed-out and
  // cancelled runs) where the process was killed before writing output. Fall back
  // to nowMs() so we still emit the dedicated agent span for these outcomes.
  if ((isAgentFailure || isAgentCancelled) && jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0) {
    agentEndMs = nowMs();
  }
}

No other code paths need to change: statusCode, statusMessage, and the conclusion span attributes already handle the cancelled case correctly (lines 714–724).

Expected Outcome

After this change:

  • In Grafana / Honeycomb / Datadog: cancelled runs produce a gh-aw.agent.agent sub-span showing how long the agent executed before cancellation, making agent execution duration visible for ALL non-OK outcomes (failure, timed_out, cancelled).
  • In the JSONL mirror: the otel.jsonl artifact for cancelled runs will contain two entries β€” the agent sub-span and the conclusion span β€” instead of just the conclusion span, enabling artifact-based debugging without a live collector.
  • For on-call engineers: "How long did the agent run before we cancelled it?" becomes answerable from the trace view alone, reducing investigation time for deliberate and accidental cancellations.
Implementation Steps
  • In actions/setup/js/send_otlp_span.cjs, change the if (isAgentFailure && ...) guard in the catch block to if ((isAgentFailure || isAgentCancelled) && ...) (one-line change)
  • Add a test case to send_otlp_span.test.cjs mirroring the existing "emits a dedicated agent span on timed_out when agent_output mtime is unavailable" test (line 1677) but for GH_AW_AGENT_CONCLUSION = "cancelled" β€” assert mockFetch was called twice and that the first call contains gh-aw.agent.agent
  • Run cd actions/setup/js && npx vitest run to confirm tests pass
  • Run make fmt to ensure formatting
  • Open a PR referencing this issue
Evidence from Live Sentry Data

The Sentry MCP server was unavailable during this analysis run (bridge returned no tools). The gap is confirmed purely from static analysis of the source code:

  • isAgentCancelled is defined and used for statusCode/statusMessage on the conclusion span (lines 715, 722–724) β€” so the infrastructure for detecting cancellation is already present.
  • The catch block comment on line 859–861 explicitly mentions "timed-out runs" but not cancelled runs, confirming the omission is unintentional.
  • The existing test suite (line 2262) for "marks cancelled conclusion spans as errors" verifies span.status.code === 2 but does not assert that a second fetch call (the agent sub-span) occurred β€” confirming the sub-span is absent.
  • The parallel timed_out test (line 1677) asserts expect(mockFetch).toHaveBeenCalledTimes(2) β€” the expected pattern for cancelled runs is identical.
Related Files
  • actions/setup/js/send_otlp_span.cjs β€” single-line fix in sendJobConclusionSpan
  • actions/setup/js/send_otlp_span.test.cjs β€” new test case (mirror of line 1677 block)

Generated by the Daily OTel Instrumentation Advisor workflow

Generated by Daily OTel Instrumentation Advisor Β· ● 215.9K Β· β—·

  • expires on Apr 30, 2026, 9:31 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions