Skip to content

SRE acceptance criteria review for evm-dump + RPC error context#445

Open
mo4islona wants to merge 2 commits intoopen-betafrom
sre/evm-dump-acceptance-criteria
Open

SRE acceptance criteria review for evm-dump + RPC error context#445
mo4islona wants to merge 2 commits intoopen-betafrom
sre/evm-dump-acceptance-criteria

Conversation

@mo4islona
Copy link
Contributor

Summary

  • SRE_REPORT.md — Full production-readiness audit of evm-dump against 7 SRE criteria (logging, metrics, health checks, graceful shutdown, retries, circuit breakers, timeouts). Scores range from 0/10 to 4/10.
  • PLAN.md — Prioritized remediation roadmap (P0–P3) with 14 actionable tasks, file-level guidance, and complexity estimates.
  • RPC error context enrichmentRpcClient.enrichError() attaches rpcUrl, rpcMethod, and durationMs to all errors that escape the retry loop via the existing addErrorContext utility. Fatal log lines now carry enough context for on-call triage without any changes to call sites.

Key findings (from SRE_REPORT.md)

Area Score Top issue
Graceful Shutdown 0/10 No SIGTERM handler — corrupted S3 objects on pod kill
Health Checks 0/10 No /healthz or /readyz endpoints
Retry Policies 4/10 retryAttempts: Number.MAX_SAFE_INTEGER — hangs forever
Metrics 4/10 RPC duration is averaged gauge, not histogram
Timeouts 3/10 RPC timeout hardcoded at 180s, S3 has none

Test plan

  • tsc --noEmit passes for util/rpc-client (only pre-existing removeArrayItem error)
  • Verify enriched error fields appear in structured logs during manual RPC failure test
  • Review PLAN.md priorities with team before starting Phase 1

🤖 Generated with Claude Code

…context

SRE_REPORT.md documents production-readiness gaps (health checks, graceful
shutdown, retry policies, metrics, timeouts). PLAN.md provides a prioritized
remediation roadmap. The RPC client now attaches rpcUrl, rpcMethod, and
durationMs to all errors that escape the retry loop, so fatal log lines
carry enough context for on-call triage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 4, 2026 10:38
@mo4islona mo4islona changed the base branch from master to open-beta March 4, 2026 10:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an SRE production-readiness audit of evm-dump and begins remediation by enriching RPC errors with structured context (rpcUrl, rpcMethod, durationMs) via a new enrichError() helper in RpcClient.

Changes:

  • New enrichError() private method in RpcClient that attaches URL, method, and duration context to errors escaping the retry loop
  • New SRE_REPORT.md scoring the service against 7 SRE criteria with detailed gap analysis
  • New PLAN.md with a prioritized (P0–P3) remediation roadmap across 14 tasks

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
util/rpc-client/src/client.ts Adds enrichError() and applies it at all error-rejection sites
evm/evm-dump/SRE_REPORT.md Full audit report scoring each SRE criterion
evm/evm-dump/PLAN.md Prioritized remediation plan derived from SRE_REPORT.md
.claude/skills/product-acceptance-criteria Symlink to shared agent skill definition
.agents/skills/product-acceptance-criteria/SKILL.md SRE checklist skill definition for AI agents

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -402,19 +402,20 @@ export class RpcClient {
}
req.resolve(result)
}, err => {
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The durationMs computation is placed inside the error callback, so it is only evaluated when a request fails. However, the startTime variable (presumably captured at the point the HTTP request is dispatched) is not visible in the diff. If startTime is defined outside the error callback and is shared across retries (i.e. reset on each dispatch attempt), the duration computed here will reflect the time since the last attempt was dispatched, not the total accumulated duration across all retry attempts. For an on-call engineer, the most useful figure is the total time spent on the operation. If startTime is not reset per attempt, durationMs is correct for that attempt but misleading if read as a total. This should be clearly documented with a comment indicating whether durationMs is per-attempt or cumulative, or it should be made unambiguously per-attempt by capturing startTime inside the dispatch closure rather than shared state.

Suggested change
}, err => {
}, err => {
// Duration of this individual send attempt only (not cumulative across retries).

Copilot uses AI. Check for mistakes.
Comment on lines +515 to +518
let call = Array.isArray(req.call) ? req.call[0] : req.call
return addErrorContext(err, {
rpcUrl: this.url,
rpcMethod: call.method,
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When req.call is a batch (array), only the first call's method is attached as rpcMethod. For a batch request, a single element's method is not representative of what failed — the entire batch may have failed, or only some calls within it. The enriched error will mislead triage by suggesting only one method was involved. Consider either joining all method names (e.g., req.call.map(c => c.method).join(',')) or adding a separate rpcBatchSize field alongside rpcMethod to make the batch nature explicit.

Suggested change
let call = Array.isArray(req.call) ? req.call[0] : req.call
return addErrorContext(err, {
rpcUrl: this.url,
rpcMethod: call.method,
let rpcMethod: string | undefined
let rpcBatchSize: number | undefined
if (Array.isArray(req.call)) {
rpcBatchSize = req.call.length
rpcMethod = req.call
.map(c => c.method)
.filter(m => m != null)
.join(',')
} else {
rpcMethod = req.call.method
}
return addErrorContext(err, {
rpcUrl: this.url,
rpcMethod,
...(rpcBatchSize != null ? {rpcBatchSize} : {}),

Copilot uses AI. Check for mistakes.
- Add comment clarifying durationMs is per-attempt, not cumulative
- For batch calls, join all method names and include rpcBatchSize

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants