SRE acceptance criteria review for evm-dump + RPC error context by mo4islona · Pull Request #445 · subsquid/squid-sdk

mo4islona · 2026-03-04T10:38:09Z

Summary

SRE_REPORT.md — Full production-readiness audit of evm-dump against 7 SRE criteria (logging, metrics, health checks, graceful shutdown, retries, circuit breakers, timeouts). Scores range from 0/10 to 4/10.
PLAN.md — Prioritized remediation roadmap (P0–P3) with 14 actionable tasks, file-level guidance, and complexity estimates.
RPC error context enrichment — RpcClient.enrichError() attaches rpcUrl, rpcMethod, and durationMs to all errors that escape the retry loop via the existing addErrorContext utility. Fatal log lines now carry enough context for on-call triage without any changes to call sites.

Key findings (from SRE_REPORT.md)

Area	Score	Top issue
Graceful Shutdown	0/10	No SIGTERM handler — corrupted S3 objects on pod kill
Health Checks	0/10	No `/healthz` or `/readyz` endpoints
Retry Policies	4/10	`retryAttempts: Number.MAX_SAFE_INTEGER` — hangs forever
Metrics	4/10	RPC duration is averaged gauge, not histogram
Timeouts	3/10	RPC timeout hardcoded at 180s, S3 has none

Test plan

tsc --noEmit passes for util/rpc-client (only pre-existing removeArrayItem error)
Verify enriched error fields appear in structured logs during manual RPC failure test
Review PLAN.md priorities with team before starting Phase 1

🤖 Generated with Claude Code

…context SRE_REPORT.md documents production-readiness gaps (health checks, graceful shutdown, retry policies, metrics, timeouts). PLAN.md provides a prioritized remediation roadmap. The RPC client now attaches rpcUrl, rpcMethod, and durationMs to all errors that escape the retry loop, so fatal log lines carry enough context for on-call triage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds an SRE production-readiness audit of evm-dump and begins remediation by enriching RPC errors with structured context (rpcUrl, rpcMethod, durationMs) via a new enrichError() helper in RpcClient.

Changes:

New enrichError() private method in RpcClient that attaches URL, method, and duration context to errors escaping the retry loop
New SRE_REPORT.md scoring the service against 7 SRE criteria with detailed gap analysis
New PLAN.md with a prioritized (P0–P3) remediation roadmap across 14 tasks

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`util/rpc-client/src/client.ts`	Adds `enrichError()` and applies it at all error-rejection sites
`evm/evm-dump/SRE_REPORT.md`	Full audit report scoring each SRE criterion
`evm/evm-dump/PLAN.md`	Prioritized remediation plan derived from SRE_REPORT.md
`.claude/skills/product-acceptance-criteria`	Symlink to shared agent skill definition
`.agents/skills/product-acceptance-criteria/SKILL.md`	SRE checklist skill definition for AI agents

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T10:39:36Z

util/rpc-client/src/client.ts

@@ -402,19 +402,20 @@ export class RpcClient {
            }
            req.resolve(result)
        }, err => {


The durationMs computation is placed inside the error callback, so it is only evaluated when a request fails. However, the startTime variable (presumably captured at the point the HTTP request is dispatched) is not visible in the diff. If startTime is defined outside the error callback and is shared across retries (i.e. reset on each dispatch attempt), the duration computed here will reflect the time since the last attempt was dispatched, not the total accumulated duration across all retry attempts. For an on-call engineer, the most useful figure is the total time spent on the operation. If startTime is not reset per attempt, durationMs is correct for that attempt but misleading if read as a total. This should be clearly documented with a comment indicating whether durationMs is per-attempt or cumulative, or it should be made unambiguously per-attempt by capturing startTime inside the dispatch closure rather than shared state.

Suggested change

}, err => {

}, err => {

// Duration of this individual send attempt only (not cumulative across retries).

Copilot · 2026-03-04T10:39:37Z

util/rpc-client/src/client.ts

+        let call = Array.isArray(req.call) ? req.call[0] : req.call
+        return addErrorContext(err, {
+            rpcUrl: this.url,
+            rpcMethod: call.method,


When req.call is a batch (array), only the first call's method is attached as rpcMethod. For a batch request, a single element's method is not representative of what failed — the entire batch may have failed, or only some calls within it. The enriched error will mislead triage by suggesting only one method was involved. Consider either joining all method names (e.g., req.call.map(c => c.method).join(',')) or adding a separate rpcBatchSize field alongside rpcMethod to make the batch nature explicit.

Suggested change

let call = Array.isArray(req.call) ? req.call[0] : req.call

return addErrorContext(err, {

rpcUrl: this.url,

rpcMethod: call.method,

let rpcMethod: string | undefined

let rpcBatchSize: number | undefined

if (Array.isArray(req.call)) {

rpcBatchSize = req.call.length

rpcMethod = req.call

.map(c => c.method)

.filter(m => m != null)

.join(',')

} else {

rpcMethod = req.call.method

}

return addErrorContext(err, {

rpcUrl: this.url,

rpcMethod,

...(rpcBatchSize != null ? {rpcBatchSize} : {}),

- Add comment clarifying durationMs is per-attempt, not cumulative - For batch calls, join all method names and include rpcBatchSize Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 4, 2026 10:38

mo4islona changed the base branch from master to open-beta March 4, 2026 10:38

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Address review: clarify durationMs scope and handle batch methods

4f7b7e4

- Add comment clarifying durationMs is per-attempt, not cumulative - For batch calls, join all method names and include rpcBatchSize Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SRE acceptance criteria review for evm-dump + RPC error context#445

SRE acceptance criteria review for evm-dump + RPC error context#445
mo4islona wants to merge 2 commits intoopen-betafrom
sre/evm-dump-acceptance-criteria

mo4islona commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	}, err => {
	}, err => {
	// Duration of this individual send attempt only (not cumulative across retries).

-        let call = Array.isArray(req.call) ? req.call[0] : req.call
-        return addErrorContext(err, {
-            rpcUrl: this.url,
-            rpcMethod: call.method,
+        let rpcMethod: string | undefined
+        let rpcBatchSize: number | undefined
+        if (Array.isArray(req.call)) {
+            rpcBatchSize = req.call.length
+            rpcMethod = req.call
+                .map(c => c.method)
+                .filter(m => m != null)
+                .join(',')
+        } else {
+            rpcMethod = req.call.method
+        }
+        return addErrorContext(err, {
+            rpcUrl: this.url,
+            rpcMethod,
+            ...(rpcBatchSize != null ? {rpcBatchSize} : {}),

Conversation

mo4islona commented Mar 4, 2026

Summary

Key findings (from SRE_REPORT.md)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants