SRE acceptance criteria review for evm-dump + RPC error context#445
SRE acceptance criteria review for evm-dump + RPC error context#445
Conversation
…context SRE_REPORT.md documents production-readiness gaps (health checks, graceful shutdown, retry policies, metrics, timeouts). PLAN.md provides a prioritized remediation roadmap. The RPC client now attaches rpcUrl, rpcMethod, and durationMs to all errors that escape the retry loop, so fatal log lines carry enough context for on-call triage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an SRE production-readiness audit of evm-dump and begins remediation by enriching RPC errors with structured context (rpcUrl, rpcMethod, durationMs) via a new enrichError() helper in RpcClient.
Changes:
- New
enrichError()private method inRpcClientthat attaches URL, method, and duration context to errors escaping the retry loop - New
SRE_REPORT.mdscoring the service against 7 SRE criteria with detailed gap analysis - New
PLAN.mdwith a prioritized (P0–P3) remediation roadmap across 14 tasks
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
util/rpc-client/src/client.ts |
Adds enrichError() and applies it at all error-rejection sites |
evm/evm-dump/SRE_REPORT.md |
Full audit report scoring each SRE criterion |
evm/evm-dump/PLAN.md |
Prioritized remediation plan derived from SRE_REPORT.md |
.claude/skills/product-acceptance-criteria |
Symlink to shared agent skill definition |
.agents/skills/product-acceptance-criteria/SKILL.md |
SRE checklist skill definition for AI agents |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -402,19 +402,20 @@ export class RpcClient { | |||
| } | |||
| req.resolve(result) | |||
| }, err => { | |||
There was a problem hiding this comment.
The durationMs computation is placed inside the error callback, so it is only evaluated when a request fails. However, the startTime variable (presumably captured at the point the HTTP request is dispatched) is not visible in the diff. If startTime is defined outside the error callback and is shared across retries (i.e. reset on each dispatch attempt), the duration computed here will reflect the time since the last attempt was dispatched, not the total accumulated duration across all retry attempts. For an on-call engineer, the most useful figure is the total time spent on the operation. If startTime is not reset per attempt, durationMs is correct for that attempt but misleading if read as a total. This should be clearly documented with a comment indicating whether durationMs is per-attempt or cumulative, or it should be made unambiguously per-attempt by capturing startTime inside the dispatch closure rather than shared state.
| }, err => { | |
| }, err => { | |
| // Duration of this individual send attempt only (not cumulative across retries). |
util/rpc-client/src/client.ts
Outdated
| let call = Array.isArray(req.call) ? req.call[0] : req.call | ||
| return addErrorContext(err, { | ||
| rpcUrl: this.url, | ||
| rpcMethod: call.method, |
There was a problem hiding this comment.
When req.call is a batch (array), only the first call's method is attached as rpcMethod. For a batch request, a single element's method is not representative of what failed — the entire batch may have failed, or only some calls within it. The enriched error will mislead triage by suggesting only one method was involved. Consider either joining all method names (e.g., req.call.map(c => c.method).join(',')) or adding a separate rpcBatchSize field alongside rpcMethod to make the batch nature explicit.
| let call = Array.isArray(req.call) ? req.call[0] : req.call | |
| return addErrorContext(err, { | |
| rpcUrl: this.url, | |
| rpcMethod: call.method, | |
| let rpcMethod: string | undefined | |
| let rpcBatchSize: number | undefined | |
| if (Array.isArray(req.call)) { | |
| rpcBatchSize = req.call.length | |
| rpcMethod = req.call | |
| .map(c => c.method) | |
| .filter(m => m != null) | |
| .join(',') | |
| } else { | |
| rpcMethod = req.call.method | |
| } | |
| return addErrorContext(err, { | |
| rpcUrl: this.url, | |
| rpcMethod, | |
| ...(rpcBatchSize != null ? {rpcBatchSize} : {}), |
- Add comment clarifying durationMs is per-attempt, not cumulative - For batch calls, join all method names and include rpcBatchSize Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
evm-dumpagainst 7 SRE criteria (logging, metrics, health checks, graceful shutdown, retries, circuit breakers, timeouts). Scores range from 0/10 to 4/10.RpcClient.enrichError()attachesrpcUrl,rpcMethod, anddurationMsto all errors that escape the retry loop via the existingaddErrorContextutility. Fatal log lines now carry enough context for on-call triage without any changes to call sites.Key findings (from SRE_REPORT.md)
/healthzor/readyzendpointsretryAttempts: Number.MAX_SAFE_INTEGER— hangs foreverTest plan
tsc --noEmitpasses forutil/rpc-client(only pre-existingremoveArrayItemerror)🤖 Generated with Claude Code