Skip to content

fix: harden internal runtime and transport paths — reduce fail-open behavior and churn#95

Merged
Lythaeon merged 122 commits intomainfrom
perf/runtime-churn-reduction
Apr 10, 2026
Merged

fix: harden internal runtime and transport paths — reduce fail-open behavior and churn#95
Lythaeon merged 122 commits intomainfrom
perf/runtime-churn-reduction

Conversation

@Lythaeon
Copy link
Copy Markdown
Owner

Description

This PR hardens SOF's internal runtime, provider-stream, extension-host, derived-state, and tx transport paths while preserving the current feature surface and public-facing behavior.

The branch combines the runtime-churn reduction work with follow-up correctness, robustness, and security hardening. The result is less internal churn on hot paths, earlier rejection of invalid configs, bounded network/file IO on more edges, and tighter startup/runtime invariants without adding new APIs or features.

Changes

Detailed list of what changed:

  • crates/sof-observer/src/framework/extension_host.rs
    • bounded TCP/WebSocket connector startup hangs
    • clamped zero startup/shutdown timeouts
    • rejected zero and oversized resource read buffers
    • rejected invalid websocket resource URLs at manifest validation time
    • rejected dead/shared/observer-ingress subscription combinations that runtime would otherwise accept and later drop
    • normalized import hygiene in hardened paths
  • crates/sof-observer/src/framework/derived_state.rs
    • bounded checkpoint load size
    • rejected oversized checkpoint writes before touching disk
    • rejected oversized replay records on encode/write so persisted segments cannot poison recovery later
  • crates/sof-observer/src/provider_stream/websocket.rs
    • bounded replay HTTP request stalls and response bodies
    • rejected replay HTTP redirects and hardened status handling
    • acknowledged pings before subscription ack handling
    • bounded websocket connect handshakes and subscription ack waits
    • clamped zero stall timeouts and normalized reconnect behavior to prevent reconnect spin loops
    • preserved/trimmed hot-path churn in notification handling and source updates
  • crates/sof-observer/src/provider_stream/yellowstone.rs
    • clamped zero gRPC connect/stall timeouts
    • bounded oversized account payload handling
    • trimmed conversion/setup churn in provider update paths
  • crates/sof-observer/src/provider_stream/laserstream.rs
    • shared client config helpers and transport defaults
    • clamped zero connect/request/stall timeouts
    • bounded oversized account payload handling
    • trimmed conversion/signature/setup churn in provider update paths
  • crates/sof-tx/src/providers.rs
    • hardened recent-blockhash RPC reads against redirects and oversized bodies
    • clamped zero request timeout configs
  • crates/sof-tx/src/submit/rpc.rs
    • bounded JSON-RPC submit response sizes and rejected redirect responses
  • crates/sof-tx/src/submit/jito.rs
    • bounded Jito HTTP submit response sizes and rejected redirects
    • clamped zero transport request timeouts
  • crates/sof-tx/src/submit/jito_grpc.rs
    • clamped zero gRPC request timeouts
  • crates/sof-tx/src/submit/types.rs
    • clamped zero direct-submit per-target/global/probe timeouts
  • crates/sof-tx/src/adapters/{plugin_host,derived_state}.rs
    • normalized import hygiene in hardened paths
    • fixed test-only import scoping so full workspace CI stays green
  • crates/sof-observer/src/app/runtime/*, ingest/*, repair/*, relay/*, verify/*, shred/*, crates/sof-support/src/lib.rs
    • continued internal churn reduction, helper reuse, retry/hard-bound fixes, and cleanup in the owning slices without changing exposed behavior
  • vendor/helius-laserstream/*
    • patched vendored dependency integration used by the provider-grpc feature set
  • docs and readmes
    • refreshed affected docs/readmes touched by the branch where behavior/setup expectations changed

For slice-related changes, include:

  • Affected slices
    • framework
    • provider_stream
    • ingest
    • app/runtime
    • repair
    • relay
    • verify
    • shred
    • sof-tx
    • sof-support
  • Cross-slice communication changes (if any) and why
    • Shared time/parsing/helper logic was consolidated into support/internal helpers to remove repeated edge handling and reduce maintenance churn.
    • Startup validation for extensions now rejects invalid manifests earlier instead of allowing runtime-time silent drops or late network failures.
    • Provider/tx transport hardening keeps the same external contracts but makes timeout/body-limit behavior explicit and non-zero.
  • Migration requirements (if any)
    • None. This PR is intended to be internal-only hardening/cleanup with no new public API surface.

Motivation

Business motivation:

  • Keep SOF operationally safe and predictable under malformed config, slow peers, stalled providers, and oversized payloads without expanding scope or changing the product surface.
  • Preserve the value of SOF as a pre-hardened runtime/tooling layer so downstream users do not have to rediscover these edge cases themselves.

Technical motivation:

  • Remove fail-open and late-failure behavior in extension/provider startup paths.
  • Bound more IO edges so stalled or oversized inputs fail deterministically.
  • Normalize non-zero timeout behavior across websocket, gRPC, and tx transports.
  • Continue internal churn reduction and helper reuse in hot/internal paths while keeping behavior stable.

Alternative approaches considered:

  • Adding opt-in hardening flags: rejected because the safer behavior should be the default behavior.
  • Masking dependency audit warnings with local ignore policy: rejected because that does not reduce actual risk.
  • Large dependency upgrades to clear cargo audit: deferred because the remaining real advisories are rooted in the vendored Solana/Agave chain and require an upstream version move rather than a localized branch fix.

Scope and impact

  • Affected slices:
    • framework, provider_stream, ingest, app/runtime, repair, relay, verify, shred, sof-tx, sof-support
  • Data/API changes:
    • No intended public API contract changes
    • No feature additions/removals
  • Backward compatibility:
    • Preserved at the public API level
    • Invalid extension/provider configs may now fail earlier at startup instead of failing later at runtime
  • Performance impact:
    • This branch started from runtime-churn reduction work and keeps those wins in place
    • The follow-up hardening in this PR prioritizes correctness/robustness with no intentional public-behavior regressions
  • Security impact:
    • Better bounded IO and timeout handling on websocket, gRPC, tx submit, observability, and derived-state paths
    • Earlier rejection of invalid manifests/configs that previously failed open or failed late
    • cargo audit still reports 2 real vulnerabilities plus advisory warnings, but the remaining real findings are upstream in the Solana/Agave dependency chain (sof-solana-gossip -> solana-runtime -> agave-precompiles) and are not locally fixable on this branch without a larger dependency move

Testing

  • Unit tests
  • Integration tests
  • Manual verification
  • Performance checks (if applicable)
  • Security checks (if applicable)

Commands/results:

cargo make ci
cargo audit --quiet

Results:

  • cargo make ci passed locally
  • cargo audit --quiet still reports upstream dependency-chain advisories rooted in Solana/Agave crates; no local suppression policy was added in this branch

Additional focused validation run during the branch:

  • targeted regression tests for websocket handshake/ack/stall bounds
  • targeted regression tests for extension-host manifest validation and timeout clamping
  • targeted regression tests for derived-state replay/checkpoint bounds
  • targeted regression tests for tx transport timeout/body-limit handling
  • focused cargo clippy -p sof --lib --tests -- -D warnings
  • focused cargo clippy -p sof-tx --lib --tests -- -D warnings

Related issues and documentation

  • Fixes:
    • N/A
  • Related:
    • branch-local hardening and runtime-churn cleanup work
  • Architecture docs: docs/architecture/README.md
  • Relevant ARD/ADR:
    • docs/architecture/ard/0002-testing-strategy-and-quality-gates.md
    • docs/architecture/ard/0003-slice-dependency-contracts.md
    • docs/architecture/ard/0004-error-taxonomy-and-failure-handling.md
    • docs/architecture/ard/0005-type-system-and-newtype-guidelines.md
    • docs/architecture/ard/0007-infrastructure-composition-and-runtime-model.md
    • docs/architecture/ard/0008-observability-and-operability-standards.md
  • Operations/runbook updates:
    • none required beyond the doc/readme touch-ups already included on the branch

Reviewer checklist

  • Code follows project standards and architecture constraints
  • Slice boundaries are respected (docs/architecture/ard/0003-slice-dependency-contracts.md)
  • Tests added/updated and passing
  • Documentation updated (README/docs/operations as needed)
  • No undocumented breaking change
  • Performance trade-offs documented where relevant
  • Security considerations addressed where relevant

Additional notes

  • This branch is intentionally broad in internal coverage but narrow in outward scope: it improves robustness, correctness, security posture, and internal churn without introducing new user-facing features.
  • The remaining cargo audit findings should be tracked as upstream dependency risk. Clearing them cleanly will require coordinated Solana/Agave dependency movement rather than more local hardening in SOF.

Celestial added 30 commits April 9, 2026 21:39
Celestial added 28 commits April 10, 2026 21:01
@Lythaeon Lythaeon merged commit f75fd2d into main Apr 10, 2026
2 checks passed
@Lythaeon Lythaeon deleted the perf/runtime-churn-reduction branch April 10, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant