Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions docs/adr/27137-add-ambient-context-metric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# ADR-27137: Track Ambient Context via First LLM Invocation Token Metrics

**Date**: 2026-04-19
**Status**: Draft
**Deciders**: pelikhan, Copilot

---

## Part 1 — Narrative (Human-Friendly)

### Context

The gh-aw tooling collects and aggregates per-run token usage from the firewall proxy's `token-usage.jsonl` log. Aggregate totals (total input, cache-read, and output tokens) are already surfaced in `audit` and `logs` outputs, but they conflate system-prompt overhead with actual task-execution cost. Teams want to compare how "heavy" the ambient context (system prompt, tools list, memory) is across different workflow configurations without writing custom log analysis. The first LLM invocation in a run is a natural proxy for ambient context size because it fires before the agent has accumulated any conversation history, so its input token count primarily reflects the static context loaded at startup.

### Decision

We will introduce an `AmbientContextMetrics` struct that captures the token footprint (`input_tokens`, `cached_tokens`, `effective_tokens`) of the chronologically first LLM invocation in `token-usage.jsonl`, and expose it as an optional `ambient_context` field in both the `audit` and `logs` JSON output schemas. Chronological ordering is determined by the `timestamp` field (RFC 3339 / RFC 3339 Nano); file order is used as a stable tiebreaker for entries that share a timestamp or lack one. The `effective_tokens` value is defined as `input_tokens + cache_read_tokens`, consistent with the existing effective-token convention in the codebase.

### Alternatives Considered

#### Alternative 1: Average or Median Token Count Across All Invocations

Computing an average or median across all invocations was considered as a way to characterize "typical" invocation cost. This was rejected because it mixes task-execution turns — which accumulate conversation history and grow over time — with the initial system-prompt turn, making it a poor proxy for ambient context size. The metric would also vary with run length, complicating cross-run comparisons.

#### Alternative 2: Expose the Full Ordered Invocation List and Let Consumers Filter

Surfacing the complete sorted list of token usage entries in the output and letting downstream tools select the first entry was considered to give consumers maximum flexibility. This was rejected because it would significantly increase output payload size for long-running agents (which may make hundreds of LLM calls) and because the first-invocation semantic is stable and well-understood enough to encode directly in the tool.

#### Alternative 3: Use the Invocation with the Fewest Input Tokens as a Proxy

Using the minimum-input-token invocation as a proxy for ambient context (assuming "lighter" calls reflect smaller context) was considered. This was rejected because minimum-token invocations can occur at any point during a run if the agent routes cheap subtasks to a smaller or cheaper model, making the metric unreliable as an ambient context indicator.

### Consequences

#### Positive
- Teams can compare ambient context overhead across workflow configurations using a single structured field without parsing raw logs.
- The metric is optional (`omitempty`) so existing JSON consumers of `audit` and `logs` output are not broken when it is absent.
- Chronological sorting of token usage entries is now an explicit, tested behavior that can be reused for future metrics that need temporal ordering.

#### Negative
- The first invocation is a heuristic proxy, not a guaranteed measure of system-prompt size. If a workflow fires a lightweight "warm-up" or health-check LLM call before the main agent invocation, the metric will reflect that call's token counts rather than the agent's true ambient context.
- Adding `AmbientContext` to `TokenUsageSummary` changes `parseTokenUsageFile` from streaming aggregation to collect-then-aggregate, which increases peak memory usage proportionally to the number of log entries (though this is bounded by the `1 MB` scanner buffer and is not expected to be significant in practice).

#### Neutral
- `token_usage.go` now imports `time` from the standard library for timestamp parsing.
- The `parseTokenUsageFile` function's internal processing order changed (collect all entries, then aggregate), but the functional output for existing aggregate fields (`TotalInputTokens`, `CacheEfficiency`, etc.) is unchanged.
- Reference documentation for `audit` and `logs` commands was updated to describe the new field.

---

## Part 2 — Normative Specification (RFC 2119)

> The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this section are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119).

### Ambient Context Extraction

1. Implementations **MUST** compute ambient context metrics from the single earliest (chronologically first) token usage entry in `token-usage.jsonl`.
2. Implementations **MUST** sort entries by the `timestamp` field using RFC 3339 Nano format first, falling back to RFC 3339 format, when timestamps are present.
3. Implementations **MUST** use file-insertion order (entry index) as a stable tiebreaker when two entries share a timestamp or when one or both entries lack a timestamp.
4. Implementations **MUST NOT** include token counts from any invocation other than the first sorted entry in the `AmbientContextMetrics` calculation.
5. Implementations **MUST** set `effective_tokens` to `input_tokens + cache_read_tokens` for the ambient context metric.
6. Implementations **SHOULD** return `nil` and omit the field when no token usage entries are available, rather than emitting a zero-value struct.

### Output Schema

1. Implementations **MUST** expose the `ambient_context` field as an optional (`omitempty`) JSON object in the `MetricsData` struct used by `audit` JSON output.
2. Implementations **MUST** expose the `ambient_context` field as an optional (`omitempty`) JSON field on each `RunData` entry in `logs` JSON output.
3. Implementations **MUST NOT** render `ambient_context` in console-formatted tabular output; the field **MUST** carry a `console:"-"` tag.
4. Implementations **MAY** expose `ambient_context` in future report sections (e.g., audit diff, multi-run trend analysis) as the metric matures.

### Conformance

An implementation is considered conformant with this ADR if it satisfies all **MUST** and **MUST NOT** requirements above. Failure to meet any **MUST** or **MUST NOT** requirement constitutes non-conformance.

---

*ADR created by [adr-writer agent]. Review and finalize before changing status from Draft to Accepted.*
7 changes: 7 additions & 0 deletions docs/src/content/docs/reference/audit.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,11 @@ gh aw audit 1234567890 --repo owner/repo

**Report sections** (rendered in Markdown or JSON): Overview, Comparison, Task/Domain, Behavior Fingerprint, Agentic Assessments, Metrics, Key Findings, Recommendations, Observability Insights, Performance Metrics, Engine Config, Prompt Analysis, Session Analysis, Safe Output Summary, MCP Server Health, Jobs, Downloaded Files, Missing Tools, Missing Data, Noops, MCP Failures, Firewall Analysis, Policy Analysis, Redacted Domains, Errors, Warnings, Tool Usage, MCP Tool Usage, Created Items.

The Metrics section includes an `ambient_context` object when available. Ambient context captures the first LLM inference footprint for the run:
- `ambient_context.input_tokens` — input tokens for the first invocation
- `ambient_context.cached_tokens` — cache-read tokens reused by the first invocation
- `ambient_context.effective_tokens` — `input_tokens + cached_tokens`

## `gh aw audit diff <base-run-id> <comparison-run-id> [<comparison-run-id>...]`

Compare behavior between workflow runs. Detects policy regressions, new unauthorized domains, behavioral drift, and changes in MCP tool usage or run metrics.
Expand Down Expand Up @@ -118,6 +123,8 @@ This feature is built into the `gh aw logs` command via the `--format` flag.

The report output includes an executive summary, domain inventory, metrics trends, MCP server health, and per-run breakdown. It detects cross-run anomalies such as domain access spikes, elevated MCP error rates, and connection rate changes.

For each run in detailed logs JSON output, an `ambient_context` object is included when token usage data is available. It reflects only the first LLM invocation in the run (`input_tokens`, `cached_tokens`, `effective_tokens`).

**Examples:**

```bash
Expand Down
37 changes: 37 additions & 0 deletions pkg/cli/audit_ambient_context_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
//go:build !integration

package cli

import (
"testing"
"time"

"github.com/github/gh-aw/pkg/workflow"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)

func TestBuildAuditDataIncludesAmbientContext(t *testing.T) {
processedRun := ProcessedRun{
Run: WorkflowRun{
DatabaseID: 1,
WorkflowName: "test",
Status: "completed",
Conclusion: "success",
CreatedAt: time.Now(),
},
TokenUsage: &TokenUsageSummary{
AmbientContext: &AmbientContextMetrics{
InputTokens: 1200,
CachedTokens: 300,
EffectiveTokens: 1500,
},
},
}

auditData := buildAuditData(processedRun, workflow.LogMetrics{}, nil)
require.NotNil(t, auditData.Metrics.AmbientContext, "ambient context should be populated")
assert.Equal(t, 1200, auditData.Metrics.AmbientContext.InputTokens, "input tokens should match")
assert.Equal(t, 300, auditData.Metrics.AmbientContext.CachedTokens, "cached tokens should match")
assert.Equal(t, 1500, auditData.Metrics.AmbientContext.EffectiveTokens, "effective tokens should match")
}
18 changes: 11 additions & 7 deletions pkg/cli/audit_report.go
Original file line number Diff line number Diff line change
Expand Up @@ -98,13 +98,14 @@ type OverviewData struct {

// MetricsData contains execution metrics
type MetricsData struct {
TokenUsage int `json:"token_usage,omitempty" console:"header:Token Usage,format:number,omitempty"`
EffectiveTokens int `json:"effective_tokens,omitempty" console:"header:Effective Tokens,format:number,omitempty"`
EstimatedCost float64 `json:"estimated_cost,omitempty" console:"header:Estimated Cost,format:cost,omitempty"`
ActionMinutes float64 `json:"action_minutes,omitempty" console:"header:Action Minutes,omitempty"`
Turns int `json:"turns,omitempty" console:"header:Turns,omitempty"`
ErrorCount int `json:"error_count" console:"header:Errors"`
WarningCount int `json:"warning_count" console:"header:Warnings"`
TokenUsage int `json:"token_usage,omitempty" console:"header:Token Usage,format:number,omitempty"`
EffectiveTokens int `json:"effective_tokens,omitempty" console:"header:Effective Tokens,format:number,omitempty"`
AmbientContext *AmbientContextMetrics `json:"ambient_context,omitempty" console:"title:Ambient Context,omitempty"`
EstimatedCost float64 `json:"estimated_cost,omitempty" console:"header:Estimated Cost,format:cost,omitempty"`
ActionMinutes float64 `json:"action_minutes,omitempty" console:"header:Action Minutes,omitempty"`
Turns int `json:"turns,omitempty" console:"header:Turns,omitempty"`
ErrorCount int `json:"error_count" console:"header:Errors"`
WarningCount int `json:"warning_count" console:"header:Warnings"`
}

// JobData contains information about individual jobs
Expand Down Expand Up @@ -285,6 +286,9 @@ func buildAuditData(processedRun ProcessedRun, metrics LogMetrics, mcpToolUsage
} else if run.EffectiveTokens > 0 {
metricsData.EffectiveTokens = run.EffectiveTokens
}
if processedRun.TokenUsage != nil && processedRun.TokenUsage.AmbientContext != nil {
metricsData.AmbientContext = processedRun.TokenUsage.AmbientContext
}

// Populate ActionMinutes from run duration so it is always visible even
// when token/turn metrics are zero (e.g. Codex runs that exit early).
Expand Down
42 changes: 42 additions & 0 deletions pkg/cli/logs_ambient_context_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
//go:build !integration

package cli

import (
"testing"
"time"

"github.com/github/gh-aw/pkg/testutil"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)

func TestBuildLogsDataIncludesAmbientContext(t *testing.T) {
tmpDir := testutil.TempDir(t, "logs-ambient-context")
processedRuns := []ProcessedRun{
{
Run: WorkflowRun{
DatabaseID: 1,
WorkflowName: "test",
Status: "completed",
Conclusion: "success",
CreatedAt: time.Now(),
LogsPath: tmpDir,
},
TokenUsage: &TokenUsageSummary{
AmbientContext: &AmbientContextMetrics{
InputTokens: 800,
CachedTokens: 200,
EffectiveTokens: 1000,
},
},
},
}

data := buildLogsData(processedRuns, tmpDir, nil)
require.Len(t, data.Runs, 1, "should produce a single run")
require.NotNil(t, data.Runs[0].AmbientContext, "ambient context should be included")
assert.Equal(t, 800, data.Runs[0].AmbientContext.InputTokens, "input tokens should match")
assert.Equal(t, 200, data.Runs[0].AmbientContext.CachedTokens, "cached tokens should match")
assert.Equal(t, 1000, data.Runs[0].AmbientContext.EffectiveTokens, "effective tokens should match")
}
Loading
Loading