π‘ OTel Instrumentation Improvement: emit gh-aw.agent.agent sub-span for cancelled runs
Analysis Date: 2026-04-23
Priority: Medium
Effort: Small (< 2h)
Problem
When a workflow run is manually cancelled while the agent is executing, the gh-aw.agent.agent sub-span is not emitted β even though a meaningful duration of agent execution occurred before cancellation. An on-call engineer looking at a cancelled trace sees only the conclusion span with gh-aw.agent.conclusion = "cancelled", with no way to answer: "How long did the agent run before we cancelled it?"
The root cause is in send_otlp_span.cjs at the statSync fallback (lines 855β863). When agent_output.json does not exist (the agent was killed before writing output), the fallback agentEndMs = nowMs() is only set for isAgentFailure ("failure" or "timed_out"). The isAgentCancelled case is not included, leaving agentEndMs = null and suppressing the sub-span.
Why This Matters (DevOps Perspective)
Cancelled runs are precisely the case where execution duration is most operationally important: if an engineer cancelled a job "because it seemed stuck", they need to know how long it had actually been running to determine whether the cancellation was warranted or premature. Without the agent sub-span, the trace shows only a status.code=2, status.message="agent cancelled" conclusion span β no duration of agent execution, no timeline anchoring. This extends MTTR for investigations into: "Was the cancellation early? Did the agent make progress? How close was it to finishing?"
The timed_out case (which already emits the sub-span via the isAgentFailure path) is the exact same pattern β the agent is killed before writing agent_output.json, and the sub-span is still useful to show how long it ran.
Current Behavior
// actions/setup/js/send_otlp_span.cjs (lines 855β869)
let agentEndMs = null;
try {
agentEndMs = fs.statSync("/tmp/gh-aw/agent_output.json").mtimeMs;
} catch {
// agent_output.json may be absent for agent failures, including timed-out
// runs where the process was killed before writing output. Fall back to
// nowMs() so we still emit the dedicated agent span for these failures.
if (isAgentFailure && jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0) {
agentEndMs = nowMs();
// ^^^ isAgentCancelled is NOT included here β cancelled runs get no sub-span
}
}
// Condition to emit agent sub-span β fails when agentEndMs is null (cancelled case)
if (jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0 && typeof agentEndMs === "number" && agentEndMs > agentStartMs) {
Where:
const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
const isAgentCancelled = agentConclusion === "cancelled"; // not used in the fallback
Proposed Change
One-line fix: extend the isAgentFailure guard to also cover isAgentCancelled.
// actions/setup/js/send_otlp_span.cjs β catch block inside sendJobConclusionSpan
} catch {
// agent_output.json may be absent for agent failures (including timed-out and
// cancelled runs) where the process was killed before writing output. Fall back
// to nowMs() so we still emit the dedicated agent span for these outcomes.
if ((isAgentFailure || isAgentCancelled) && jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0) {
agentEndMs = nowMs();
}
}
No other code paths need to change: statusCode, statusMessage, and the conclusion span attributes already handle the cancelled case correctly (lines 714β724).
Expected Outcome
After this change:
- In Grafana / Honeycomb / Datadog: cancelled runs produce a
gh-aw.agent.agent sub-span showing how long the agent executed before cancellation, making agent execution duration visible for ALL non-OK outcomes (failure, timed_out, cancelled).
- In the JSONL mirror: the
otel.jsonl artifact for cancelled runs will contain two entries β the agent sub-span and the conclusion span β instead of just the conclusion span, enabling artifact-based debugging without a live collector.
- For on-call engineers: "How long did the agent run before we cancelled it?" becomes answerable from the trace view alone, reducing investigation time for deliberate and accidental cancellations.
Implementation Steps
Evidence from Live Sentry Data
The Sentry MCP server was unavailable during this analysis run (bridge returned no tools). The gap is confirmed purely from static analysis of the source code:
isAgentCancelled is defined and used for statusCode/statusMessage on the conclusion span (lines 715, 722β724) β so the infrastructure for detecting cancellation is already present.
- The
catch block comment on line 859β861 explicitly mentions "timed-out runs" but not cancelled runs, confirming the omission is unintentional.
- The existing test suite (line 2262) for "marks cancelled conclusion spans as errors" verifies
span.status.code === 2 but does not assert that a second fetch call (the agent sub-span) occurred β confirming the sub-span is absent.
- The parallel
timed_out test (line 1677) asserts expect(mockFetch).toHaveBeenCalledTimes(2) β the expected pattern for cancelled runs is identical.
Related Files
actions/setup/js/send_otlp_span.cjs β single-line fix in sendJobConclusionSpan
actions/setup/js/send_otlp_span.test.cjs β new test case (mirror of line 1677 block)
Generated by the Daily OTel Instrumentation Advisor workflow
Generated by Daily OTel Instrumentation Advisor Β· β 215.9K Β· β·
π‘ OTel Instrumentation Improvement: emit
gh-aw.agent.agentsub-span for cancelled runsAnalysis Date: 2026-04-23
Priority: Medium
Effort: Small (< 2h)
Problem
When a workflow run is manually cancelled while the agent is executing, the
gh-aw.agent.agentsub-span is not emitted β even though a meaningful duration of agent execution occurred before cancellation. An on-call engineer looking at a cancelled trace sees only the conclusion span withgh-aw.agent.conclusion = "cancelled", with no way to answer: "How long did the agent run before we cancelled it?"The root cause is in
send_otlp_span.cjsat thestatSyncfallback (lines 855β863). Whenagent_output.jsondoes not exist (the agent was killed before writing output), the fallbackagentEndMs = nowMs()is only set forisAgentFailure("failure"or"timed_out"). TheisAgentCancelledcase is not included, leavingagentEndMs = nulland suppressing the sub-span.Why This Matters (DevOps Perspective)
Cancelled runs are precisely the case where execution duration is most operationally important: if an engineer cancelled a job "because it seemed stuck", they need to know how long it had actually been running to determine whether the cancellation was warranted or premature. Without the agent sub-span, the trace shows only a
status.code=2, status.message="agent cancelled"conclusion span β no duration of agent execution, no timeline anchoring. This extends MTTR for investigations into: "Was the cancellation early? Did the agent make progress? How close was it to finishing?"The
timed_outcase (which already emits the sub-span via theisAgentFailurepath) is the exact same pattern β the agent is killed before writingagent_output.json, and the sub-span is still useful to show how long it ran.Current Behavior
Where:
Proposed Change
One-line fix: extend the
isAgentFailureguard to also coverisAgentCancelled.No other code paths need to change:
statusCode,statusMessage, and the conclusion span attributes already handle the cancelled case correctly (lines 714β724).Expected Outcome
After this change:
gh-aw.agent.agentsub-span showing how long the agent executed before cancellation, making agent execution duration visible for ALL non-OK outcomes (failure, timed_out, cancelled).otel.jsonlartifact for cancelled runs will contain two entries β the agent sub-span and the conclusion span β instead of just the conclusion span, enabling artifact-based debugging without a live collector.Implementation Steps
actions/setup/js/send_otlp_span.cjs, change theif (isAgentFailure && ...)guard in thecatchblock toif ((isAgentFailure || isAgentCancelled) && ...)(one-line change)send_otlp_span.test.cjsmirroring the existing"emits a dedicated agent span on timed_out when agent_output mtime is unavailable"test (line 1677) but forGH_AW_AGENT_CONCLUSION = "cancelled"β assertmockFetchwas called twice and that the first call containsgh-aw.agent.agentcd actions/setup/js && npx vitest runto confirm tests passmake fmtto ensure formattingEvidence from Live Sentry Data
The Sentry MCP server was unavailable during this analysis run (bridge returned no tools). The gap is confirmed purely from static analysis of the source code:
isAgentCancelledis defined and used forstatusCode/statusMessageon the conclusion span (lines 715, 722β724) β so the infrastructure for detecting cancellation is already present.catchblock comment on line 859β861 explicitly mentions "timed-out runs" but not cancelled runs, confirming the omission is unintentional.span.status.code === 2but does not assert that a second fetch call (the agent sub-span) occurred β confirming the sub-span is absent.timed_outtest (line 1677) assertsexpect(mockFetch).toHaveBeenCalledTimes(2)β the expected pattern for cancelled runs is identical.Related Files
actions/setup/js/send_otlp_span.cjsβ single-line fix insendJobConclusionSpanactions/setup/js/send_otlp_span.test.cjsβ new test case (mirror of line 1677 block)Generated by the Daily OTel Instrumentation Advisor workflow