[otel-advisor] OTel improvement: emit agent execution sub-span for cancelled workflow runs

### 📡 OTel Instrumentation Improvement: emit `gh-aw.agent.agent` sub-span for cancelled runs

**Analysis Date**: 2026-04-23
**Priority**: Medium
**Effort**: Small (< 2h)

### Problem

When a workflow run is manually cancelled while the agent is executing, the `gh-aw.agent.agent` sub-span is **not emitted** — even though a meaningful duration of agent execution occurred before cancellation. An on-call engineer looking at a cancelled trace sees only the conclusion span with `gh-aw.agent.conclusion = "cancelled"`, with no way to answer: *"How long did the agent run before we cancelled it?"*

The root cause is in `send_otlp_span.cjs` at the `statSync` fallback (lines 855–863). When `agent_output.json` does not exist (the agent was killed before writing output), the fallback `agentEndMs = nowMs()` is only set for `isAgentFailure` (`"failure"` or `"timed_out"`). The `isAgentCancelled` case is not included, leaving `agentEndMs = null` and suppressing the sub-span.

<details>
<summary>Why This Matters (DevOps Perspective)</summary>

Cancelled runs are precisely the case where execution duration is most operationally important: if an engineer cancelled a job "because it seemed stuck", they need to know how long it had actually been running to determine whether the cancellation was warranted or premature. Without the agent sub-span, the trace shows only a `status.code=2, status.message="agent cancelled"` conclusion span — no duration of agent execution, no timeline anchoring. This extends MTTR for investigations into: "Was the cancellation early? Did the agent make progress? How close was it to finishing?"

The `timed_out` case (which already emits the sub-span via the `isAgentFailure` path) is the exact same pattern — the agent is killed before writing `agent_output.json`, and the sub-span is still useful to show how long it ran.

</details>

<details>
<summary>Current Behavior</summary>

```javascript
// actions/setup/js/send_otlp_span.cjs (lines 855–869)
let agentEndMs = null;
try {
 agentEndMs = fs.statSync("/tmp/gh-aw/agent_output.json").mtimeMs;
} catch {
 // agent_output.json may be absent for agent failures, including timed-out
 // runs where the process was killed before writing output. Fall back to
 // nowMs() so we still emit the dedicated agent span for these failures.
 if (isAgentFailure && jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0) {
 agentEndMs = nowMs();
 // ^^^ isAgentCancelled is NOT included here — cancelled runs get no sub-span
 }
}

// Condition to emit agent sub-span — fails when agentEndMs is null (cancelled case)
if (jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0 && typeof agentEndMs === "number" && agentEndMs > agentStartMs) {
```

Where:
```javascript
const isAgentFailure = agentConclusion === "failure" || agentConclusion === "timed_out";
const isAgentCancelled = agentConclusion === "cancelled"; // not used in the fallback
```

</details>

<details>
<summary>Proposed Change</summary>

One-line fix: extend the `isAgentFailure` guard to also cover `isAgentCancelled`.

```javascript
// actions/setup/js/send_otlp_span.cjs — catch block inside sendJobConclusionSpan
} catch {
 // agent_output.json may be absent for agent failures (including timed-out and
 // cancelled runs) where the process was killed before writing output. Fall back
 // to nowMs() so we still emit the dedicated agent span for these outcomes.
 if ((isAgentFailure || isAgentCancelled) && jobName === "agent" && typeof agentStartMs === "number" && agentStartMs > 0) {
 agentEndMs = nowMs();
 }
}
```

No other code paths need to change: `statusCode`, `statusMessage`, and the conclusion span attributes already handle the cancelled case correctly (lines 714–724).

</details>

<details>
<summary>Expected Outcome</summary>

After this change:

- In Grafana / Honeycomb / Datadog: cancelled runs produce a `gh-aw.agent.agent` sub-span showing how long the agent executed before cancellation, making agent execution duration visible for ALL non-OK outcomes (failure, timed_out, cancelled).
- In the JSONL mirror: the `otel.jsonl` artifact for cancelled runs will contain two entries — the agent sub-span and the conclusion span — instead of just the conclusion span, enabling artifact-based debugging without a live collector.
- For on-call engineers: "How long did the agent run before we cancelled it?" becomes answerable from the trace view alone, reducing investigation time for deliberate and accidental cancellations.

</details>

<details>
<summary>Implementation Steps</summary>

- [ ] In `actions/setup/js/send_otlp_span.cjs`, change the `if (isAgentFailure && ...)` guard in the `catch` block to `if ((isAgentFailure || isAgentCancelled) && ...)` (one-line change)
- [ ] Add a test case to `send_otlp_span.test.cjs` mirroring the existing `"emits a dedicated agent span on timed_out when agent_output mtime is unavailable"` test (line 1677) but for `GH_AW_AGENT_CONCLUSION = "cancelled"` — assert `mockFetch` was called twice and that the first call contains `gh-aw.agent.agent`
- [ ] Run `cd actions/setup/js && npx vitest run` to confirm tests pass
- [ ] Run `make fmt` to ensure formatting
- [ ] Open a PR referencing this issue

</details>

<details>
<summary>Evidence from Live Sentry Data</summary>

The Sentry MCP server was unavailable during this analysis run (bridge returned no tools). The gap is confirmed purely from static analysis of the source code:

- `isAgentCancelled` is defined and used for `statusCode`/`statusMessage` on the conclusion span (lines 715, 722–724) — so the infrastructure for detecting cancellation is already present.
- The `catch` block comment on line 859–861 explicitly mentions "timed-out runs" but not cancelled runs, confirming the omission is unintentional.
- The existing test suite (line 2262) for "marks cancelled conclusion spans as errors" verifies `span.status.code === 2` but does **not** assert that a second fetch call (the agent sub-span) occurred — confirming the sub-span is absent.
- The parallel `timed_out` test (line 1677) asserts `expect(mockFetch).toHaveBeenCalledTimes(2)` — the expected pattern for cancelled runs is identical.

</details>

<details>
<summary>Related Files</summary>

- `actions/setup/js/send_otlp_span.cjs` — single-line fix in `sendJobConclusionSpan`
- `actions/setup/js/send_otlp_span.test.cjs` — new test case (mirror of line 1677 block)

</details>

---

*Generated by the [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24859565490) workflow*







> Generated by [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24859565490/agentic_workflow) · ● 215.9K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-otel-instrumentation-advisor%22&type=issues)
> - [x] expires  on Apr 30, 2026, 9:31 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[otel-advisor] OTel improvement: emit agent execution sub-span for cancelled workflow runs #28168

📡 OTel Instrumentation Improvement: emit `gh-aw.agent.agent` sub-span for cancelled runs

Problem

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[otel-advisor] OTel improvement: emit agent execution sub-span for cancelled workflow runs #28168

Description

📡 OTel Instrumentation Improvement: emit gh-aw.agent.agent sub-span for cancelled runs

Problem

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

📡 OTel Instrumentation Improvement: emit `gh-aw.agent.agent` sub-span for cancelled runs