refactor(tracker): library API hardening for v1.0#113
refactor(tracker): library API hardening for v1.0#113clintecker wants to merge 2 commits intomainfrom
Conversation
Batches six related pre-1.0 fixes flagged in the expert panel review of PR #101 (#102, #103, #104, #105, #106, #109). All changes touch the v0.18.0 library surface and are breaking, so ship together: - Thread context.Context through Doctor, Diagnose, DiagnoseMostRecent, Audit, and Simulate. Provider probes and binary version lookups honor ctx; getBinaryVersion now uses exec.CommandContext with a 5-second timeout to match getDippinVersion. - Introduce typed CheckStatus and SuggestionKind so consumers can switch-exhaust. Constant values are unchanged. - NDJSONWriter.Write now returns error. First write failure is still logged once to stderr; subsequent failures surface via return value. - Rename NDJSONEvent -> StreamEvent (wire format unchanged). - Move DoctorConfig.TrackerVersion/TrackerCommit off the config struct behind a WithVersionInfo functional option. - Add LogWriter to DoctorConfig/DiagnoseConfig/AuditConfig. CLI sets it to io.Discard so users stop seeing stray library warnings from audit/list/diagnose on os.Stderr. - sanitizeProviderError strips API keys and bearer tokens from provider error bodies before they land in CheckDetail.Message. - Panic-recover the NDJSON handler closures (pipeline, agent, LLM trace) so a misbehaving writer cannot crash the caller goroutine. - Stream activity.jsonl in Diagnose via bufio.Scanner (1 MB buffer limit), matching LoadActivityLog and avoiding memory spikes on large runs. - Rename internal *Lib/*NF/*ForDiagnose helpers to plain names. - CHANGELOG notes deterministic diagnose suggestion ordering (already alphabetical post-#101, documented as a visible change). Closes #102, #103, #104, #105, #106, #109. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WalkthroughThis PR applies coordinated API and implementation changes across the tracker package: adds context parameters to public entry points, introduces typed enums for statuses/kinds, renames NDJSON/wire types and makes writer writes return errors, replaces DoctorConfig version fields with a functional option, streams activity parsing, adds configurable LogWriter for warnings, and sanitizes provider error messages. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b333d15fa8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if ctx.Err() != nil { | ||
| break | ||
| } |
There was a problem hiding this comment.
Return cancellation error when Diagnose context is done
Breaking out of the scan loop on ctx.Err() returns a partial DiagnoseReport with nil error, so callers using deadlines/cancellation cannot tell the analysis was truncated. This makes downstream automation treat incomplete diagnostics as authoritative whenever the context is canceled mid-parse. Diagnose should propagate ctx.Err() (or wrap it) instead of silently succeeding with partial results.
Useful? React with 👍 / 👎.
| applyRetryAnalysis(failures, failSignatures) | ||
| return halt |
There was a problem hiding this comment.
Handle scanner errors before returning Diagnose activity data
enrichFromActivity never checks scanner.Err(), so read failures (including bufio.ErrTooLong once a JSONL line exceeds the 1 MB buffer) are silently ignored and the function returns incomplete failure/retry/budget analysis as if successful. Because this path replaced full-file parsing, large or corrupted activity.jsonl inputs now degrade results without any signal.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tracker_diagnose.go (1)
204-243:⚠️ Potential issue | 🟡 MinorScanner error is not checked after the loop.
The
bufio.Scannercan fail for reasons other than EOF (e.g., line exceeds buffer size despite the 1MB limit, or I/O error). The code correctly sets up the scanner with a larger buffer but doesn't checkscanner.Err()after the loop exits. WhileenrichFromActivityreturns only theBudgetHaltstruct (not an error), silently ignoring scanner errors could mask truncated reads.Consider logging scanner errors to
logWif one is available in scope, or at minimum documenting that parse errors are ignored.🔧 Proposed fix to log scanner errors
for scanner.Scan() { if ctx.Err() != nil { break } line := strings.TrimSpace(scanner.Text()) if line == "" { continue } var entry diagnoseEntry if err := json.Unmarshal([]byte(line), &entry); err != nil { continue } if entry.Type == "budget_exceeded" { halt = &BudgetHalt{ TotalTokens: entry.TotalTokens, TotalCostUSD: entry.TotalCostUSD, WallElapsedMs: entry.WallElapsedMs, Message: entry.Message, } } enrichFromEntry(entry, failures, stageStarts, failSignatures) } + // Scanner errors (I/O, buffer overflow) are non-fatal for diagnosis; + // partial data is still useful. Could log to a writer if needed. + _ = scanner.Err() applyRetryAnalysis(failures, failSignatures) return halt }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tracker_diagnose.go` around lines 204 - 243, enrichFromActivity uses bufio.Scanner but never checks scanner.Err() after the Scan loop, which can silently drop I/O or buffer errors; after the for scanner.Scan() loop in enrichFromActivity, call err := scanner.Err() and if err != nil log the error (prefer using existing logger variable logW if in scope, otherwise the package logger or fmt) including context like runDir and the error; do not change the function signature—just add the post-loop check and a descriptive log line referencing scanner, runDir, and the error.
🧹 Nitpick comments (1)
tracker_events.go (1)
148-156: Package-levelndjsonPanicOncesuppresses panics across all writer instances.The
ndjsonPanicOnceis a package-levelsync.Once, meaning that if oneNDJSONWriterinstance recovers from a panic, all subsequent panics from anyNDJSONWriterinstance in the process will be silently suppressed without logging. This could mask unrelated issues when multiple writers are used (e.g., different output streams for different pipeline runs).Consider making this per-instance on
NDJSONWriter:Suggested change
type NDJSONWriter struct { mu sync.Mutex w io.Writer errOnce sync.Once + panicOnce sync.Once } -var ndjsonPanicOnce sync.Once - -func recoverNDJSONPanic(source string) { +func (s *NDJSONWriter) recoverPanic(source string) { if r := recover(); r != nil { - ndjsonPanicOnce.Do(func() { + s.panicOnce.Do(func() { fmt.Fprintf(os.Stderr, "tracker: NDJSON %s handler recovered from panic: %v (further panics suppressed)\n", source, r) }) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tracker_events.go` around lines 148 - 156, The package-level ndjsonPanicOnce causes panic-suppression across all NDJSONWriter instances; change this to a per-instance sync.Once field on NDJSONWriter (e.g., add a field like panicOnce sync.Once), convert recoverNDJSONPanic into a method on *NDJSONWriter that calls the instance's panicOnce.Do and prints the same message (or accept the source string), and update all call sites that currently call recoverNDJSONPanic(...) to call the new method (writer.recoverNDJSONPanic(...)). This ensures each NDJSONWriter tracks its own suppression state and prevents unrelated writers from being muted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Line 591: The README line incorrectly implies all four APIs accept optional
config structs with LogWriter; update the wording to state that tracker.Audit,
tracker.Doctor and the Diagnose API (AuditConfig, DoctorConfig, DiagnoseConfig)
accept an optional config struct with a LogWriter for non‑fatal parse warnings
(set to io.Discard to silence), while tracker.Simulate does not use those config
structs and does not accept a LogWriter—adjust the sentence to explicitly
exclude tracker.Simulate and name the exact structs (AuditConfig,
DiagnoseConfig, DoctorConfig) and LogWriter to avoid confusion.
In `@tracker_audit.go`:
- Around line 85-89: The nil-check in Audit(ctx context.Context, runDir string,
opts ...AuditConfig) is ineffectual because ctx is reassigned but never used; to
silence the linter and document intent, keep the nil-check but add a deliberate
no-op usage like `_ = ctx` after the check (or alternatively remove the
nil-check entirely if you prefer), referencing the Audit function and the
existing call to firstAuditConfig(opts) so reviewers can find the spot to
modify.
In `@tracker_doctor.go`:
- Around line 41-50: DoctorConfig.LogWriter is declared but never used; add a
helper similar to logWriterOrDiscard() and thread its result through the Doctor
flow and any check* functions that emit non-fatal parse warnings so those checks
write to LogWriter instead of being silent. Specifically, implement a
logWriterOrDiscard() helper (as used in tracker_audit.go and
tracker_diagnose.go), call it at the start of Doctor (or where DoctorConfig is
consumed), pass the returned io.Writer into check functions invoked by Doctor
(e.g., any check* functions in this file), and replace direct writes to
stderr/discard with fmt.Fprintf to the provided writer; alternatively remove
DoctorConfig.LogWriter if you opt not to surface warnings.
---
Outside diff comments:
In `@tracker_diagnose.go`:
- Around line 204-243: enrichFromActivity uses bufio.Scanner but never checks
scanner.Err() after the Scan loop, which can silently drop I/O or buffer errors;
after the for scanner.Scan() loop in enrichFromActivity, call err :=
scanner.Err() and if err != nil log the error (prefer using existing logger
variable logW if in scope, otherwise the package logger or fmt) including
context like runDir and the error; do not change the function signature—just add
the post-loop check and a descriptive log line referencing scanner, runDir, and
the error.
---
Nitpick comments:
In `@tracker_events.go`:
- Around line 148-156: The package-level ndjsonPanicOnce causes
panic-suppression across all NDJSONWriter instances; change this to a
per-instance sync.Once field on NDJSONWriter (e.g., add a field like panicOnce
sync.Once), convert recoverNDJSONPanic into a method on *NDJSONWriter that calls
the instance's panicOnce.Do and prints the same message (or accept the source
string), and update all call sites that currently call recoverNDJSONPanic(...)
to call the new method (writer.recoverNDJSONPanic(...)). This ensures each
NDJSONWriter tracks its own suppression state and prevents unrelated writers
from being muted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 69bf4bea-3dee-4100-be47-c0fc75718dcb
📒 Files selected for processing (20)
.gitignoreCHANGELOG.mdREADME.mdcmd/tracker/audit.gocmd/tracker/diagnose.gocmd/tracker/doctor.gocmd/tracker/simulate.gotracker_activity.gotracker_audit.gotracker_audit_test.gotracker_diagnose.gotracker_diagnose_test.gotracker_doctor.gotracker_doctor_test.gotracker_doctor_unix.gotracker_doctor_windows.gotracker_events.gotracker_events_test.gotracker_simulate.gotracker_simulate_test.go
- Diagnose: propagate ctx.Err() instead of silently truncating the report on cancellation. Partial reports are never returned as success, so deadline-aware callers can distinguish complete from truncated output. (Codex P1) - enrichFromActivity: surface bufio.Scanner errors (1 MB line overflow, I/O) instead of dropping them on the floor. Returns the error up to Diagnose and logs it via the LogWriter. (Codex P2, CodeRabbit) - NDJSONWriter: panicOnce is now a per-instance field, not a package-level sync.Once. Before this, one misbehaving sink would silence panic logging across every NDJSONWriter in the process. Added a test to verify per-instance isolation. (CodeRabbit) - Drop unused DoctorConfig.LogWriter. Doctor doesn't emit stderr warnings — the field was added symmetrically with Audit/Diagnose but had no consumers. (CodeRabbit) - Drop ineffectual ctx nil-check in Audit and Simulate. ctx is reserved for future extensibility; silence the linter with `_ = ctx` and drop the dead nil-guard. (CodeRabbit, golangci-lint) - README: correct the library-API sentence — Simulate does not take a config struct; Doctor uses functional options, not AuditConfig-shaped options. (CodeRabbit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tracker_simulate.go (1)
57-58: Normalize nilctxnow to lock in API behavior.
Small forward-compat improvement: coalesce nil tocontext.Background()instead of using_ = ctx, so future cancellation plumbing won’t accidentally break nil-callers.Suggested diff
func Simulate(ctx context.Context, source string) (*SimulateReport, error) { - _ = ctx // reserved for future cancellation plumbing + if ctx == nil { + ctx = context.Background() + } format := detectSourceFormat(source) graph, err := parsePipelineSource(source, format)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tracker_simulate.go` around lines 57 - 58, The Simulate function currently ignores the incoming ctx (using "_ = ctx"), which can cause future cancellation plumbing to break for nil callers; update Simulate (function Simulate in tracker_simulate.go) to coalesce a nil ctx to context.Background() by checking if ctx == nil and assigning context.Background() before using ctx so future cancellation logic works correctly.tracker_audit.go (1)
17-24: ClarifyAuditConfigdoc scope to avoid caller ambiguity.The comment says this config applies to
Audit()andListRuns(), butLogWriteris currently only consumed inListRuns(). Consider tightening wording so callers don’t expectAudit()warning redirection yet.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tracker_audit.go` around lines 17 - 24, The doc comment for AuditConfig is misleading: it claims to configure Audit() and ListRuns(), but the LogWriter field is only used by ListRuns(); either update Audit() to honor LogWriter or narrow the comment. Change the AuditConfig top-level comment to state it configures ListRuns() (and/or other functions that actually consume it) and update the LogWriter field comment to explicitly say "LogWriter is consumed by ListRuns() only; nil is treated as io.Discard..." so callers won't expect Audit() to redirect warnings unless you also modify the Audit() implementation to use AuditConfig.LogWriter.tracker_events_test.go (1)
392-410: Assert the per-instance stderr side effect here.This only proves both handlers recover. It does not verify the regression called out in the test name: if panic suppression becomes package-scoped again, this still passes because nothing observes stderr. Capture stderr and assert that each writer emits its own first panic log line.
Possible tightening
func TestNDJSONWriter_PanicSuppressionIsPerInstance(t *testing.T) { + oldStderr := os.Stderr + r, wpipe, err := os.Pipe() + if err != nil { + t.Fatalf("pipe: %v", err) + } + os.Stderr = wpipe + defer func() { + os.Stderr = oldStderr + _ = r.Close() + }() + defer func() { if r := recover(); r != nil { t.Fatalf("handlers should recover, got: %v", r) } }() w1 := NewNDJSONWriter(&panicWriter{}) w2 := NewNDJSONWriter(&panicWriter{}) // First panic on w1 — must not consume w2's Once. w1.PipelineHandler().HandlePipelineEvent(pipeline.PipelineEvent{ Type: pipeline.EventPipelineStarted, Timestamp: time.Now(), }) // Second panic on w2 — still the first on its own instance. w2.PipelineHandler().HandlePipelineEvent(pipeline.PipelineEvent{ Type: pipeline.EventPipelineStarted, Timestamp: time.Now(), }) + + _ = wpipe.Close() + out, err := io.ReadAll(r) + if err != nil { + t.Fatalf("read stderr: %v", err) + } + if got := len(strings.Split(strings.TrimSpace(string(out)), "\n")); got != 2 { + t.Fatalf("got %d panic log lines, want 2", got) + } }Also add
ioandosimports.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tracker_events_test.go` around lines 392 - 410, The test TestNDJSONWriter_PanicSuppressionIsPerInstance must capture and assert stderr output so it verifies per-instance panic logging: wrap the calls that trigger panics (use NewNDJSONWriter(&panicWriter{}) and call w1.PipelineHandler().HandlePipelineEvent(...) and w2.PipelineHandler().HandlePipelineEvent(...)) while redirecting os.Stderr to a pipe (using io and os imports), read the captured output, and assert that there are two separate first-panic log lines (one for w1 and one for w2) indicating each instance emitted its own panic message; restore stderr after capture. Ensure assertions reference the instances w1 and w2 and that the captured output contains distinct panic entries for each.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tracker_audit.go`:
- Around line 17-24: The doc comment for AuditConfig is misleading: it claims to
configure Audit() and ListRuns(), but the LogWriter field is only used by
ListRuns(); either update Audit() to honor LogWriter or narrow the comment.
Change the AuditConfig top-level comment to state it configures ListRuns()
(and/or other functions that actually consume it) and update the LogWriter field
comment to explicitly say "LogWriter is consumed by ListRuns() only; nil is
treated as io.Discard..." so callers won't expect Audit() to redirect warnings
unless you also modify the Audit() implementation to use AuditConfig.LogWriter.
In `@tracker_events_test.go`:
- Around line 392-410: The test TestNDJSONWriter_PanicSuppressionIsPerInstance
must capture and assert stderr output so it verifies per-instance panic logging:
wrap the calls that trigger panics (use NewNDJSONWriter(&panicWriter{}) and call
w1.PipelineHandler().HandlePipelineEvent(...) and
w2.PipelineHandler().HandlePipelineEvent(...)) while redirecting os.Stderr to a
pipe (using io and os imports), read the captured output, and assert that there
are two separate first-panic log lines (one for w1 and one for w2) indicating
each instance emitted its own panic message; restore stderr after capture.
Ensure assertions reference the instances w1 and w2 and that the captured output
contains distinct panic entries for each.
In `@tracker_simulate.go`:
- Around line 57-58: The Simulate function currently ignores the incoming ctx
(using "_ = ctx"), which can cause future cancellation plumbing to break for nil
callers; update Simulate (function Simulate in tracker_simulate.go) to coalesce
a nil ctx to context.Background() by checking if ctx == nil and assigning
context.Background() before using ctx so future cancellation logic works
correctly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 72db3a01-f4bf-4ac1-95c4-f99b3a5f1fdd
📒 Files selected for processing (10)
CHANGELOG.mdREADME.mdcmd/tracker/doctor.gotracker_audit.gotracker_diagnose.gotracker_diagnose_test.gotracker_doctor.gotracker_events.gotracker_events_test.gotracker_simulate.go
✅ Files skipped from review due to trivial changes (1)
- README.md
🚧 Files skipped from review as they are similar to previous changes (5)
- cmd/tracker/doctor.go
- tracker_diagnose_test.go
- CHANGELOG.md
- tracker_diagnose.go
- tracker_events.go
Summary
Batches six pre-1.0 library-API issues flagged in the expert panel review of PR #101 into one breaking change. All of them touch the new v0.18.0 library surface (
tracker.Doctor,tracker.Diagnose,tracker.Audit,tracker.Simulate,NDJSONWriter), so bundling them means one semver bump instead of six.Closes #102, #103, #104, #105, #106, #109.
What's in
context.Contextis now the first argument toDoctor,Diagnose,DiagnoseMostRecent,Audit,Simulate. A nil context is treated ascontext.Background().CheckStatusandSuggestionKindare now typed strings (enum-like). Existing constant values unchanged.NDJSONWriter.Writenow returnserror. First failure still logs once toos.Stderr; subsequent failures surface via the return value.NDJSONEvent→StreamEvent. Wire-format JSON tags unchanged.DoctorConfig.TrackerVersion/TrackerCommitremoved; usetracker.WithVersionInfo(version, commit)functional option instead.*Lib/*NF/*ForDiagnosesuffixes removed from package-private helpers (checkEnvWarnings,checkProviders,collectNodeFailures,enrichFromEntry,buildRunSummary, etc.).Diagnosenow streamsactivity.jsonlwithbufio.Scanner(1 MB buffer) instead ofos.ReadFile+strings.Split— matchesLoadActivityLogand avoids a memory spike on large runs.getBinaryVersionnow usesexec.CommandContextwith a 5-second timeout. NDJSONWriter backpressure behavior is now documented on the type.LogWriter io.Writerfield toDoctorConfig,DiagnoseConfig,AuditConfig. Nil →io.Discard. CLI sets it toio.Discardforaudit/list/diagnoseso users no longer see straywarning: run X: cannot read activity logon stderr.sanitizeProviderErrorstrips API keys (sk-ant-, sk-, AIza*) and bearer tokens from provider error text before they land inCheckDetail.Message(which consumers may log or forward to webhooks).NDJSONWriter's three handler closures (pipeline, agent, LLM trace) nowrecover()from panics in the underlying writer and log a single line to stderr. A misbehaving sink cannot crash the caller goroutine.Test plan
go build ./...passesgo test ./... -shortpasses (all 17 packages)go test ./...full suite passesgo test -race -run TestNDJSONpassesgo vet ./...cleandippin doctoron the three core pipelines → A / 100tracker doctor,tracker simulate ...,tracker liststill produce correct outputsanitizeProviderError(5 cases),NDJSONWriter.Writeerror propagation, panic recovery in all three handler factories,LogWriterparameter redirecting warnings inListRunsBreaking changes (library only)
Callers of the following signatures need updates:
Typed
CheckStatus/SuggestionKindare source-compatible with untyped string-literal comparisons (status == "ok"still works).🤖 Generated with Claude Code
Summary by CodeRabbit
Breaking Changes
New Features
Performance
Bug Fixes