fix(observability): skip Sentry for transport-level + transient-upstream errors (TAURI-32 / 5Z / 2G)#1601
Conversation
…PENHUMAN-TAURI-32) reqwest's "error sending request for url (…)" fires before any HTTP status when DNS / TCP / TLS handshake fails, or when the user's ISP blocks the route to api.tinyhumans.ai (the impacted event came from a RU user). No status, no trace, no payload — Sentry has no signal to act on, and every retry exhaustion produces another noisy event. Classify the transport-level shapes (error sending request, DNS error, connection refused/reset, network unreachable, no route to host, TLS handshake, cert verify) as ExpectedErrorKind::NetworkUnreachable so report_error_or_expected logs a warn-level breadcrumb instead of spawning a Sentry error event. Switch the web_channel.run_chat_task failure site from report_error to report_error_or_expected so it picks up the classifier. Same pattern as param-validation (4a36b4f), budget-exceeded (c7ac365), and transient-upstream-HTTP (afdc268).
…ures (OPENHUMAN-TAURI-2G) reqwest's "error sending request for url (…) → tls handshake eof" fires before any HTTP status when the user's TLS path to api.tinyhumans.ai breaks — captive portal, ISP MITM, transient handshake reset (the impacted event came from a SG user). The observability classifier already recognizes "tls handshake" / "error sending request for url" as ExpectedErrorKind::NetworkUnreachable, but the integrations client was calling `report_error` directly on the transport path, bypassing the classifier. One Sentry event per failed GET/POST. Switch the two transport-failure sites in `integrations/client.rs` (`post` and `get`) from `report_error` to `report_error_or_expected` so the classifier picks them up and emits a warn-level breadcrumb instead of a Sentry error event. Non-2xx and envelope-error paths are unchanged — those are actionable backend failures. Same pattern as web_channel.run_chat_task and agent.run_single (OPENHUMAN-TAURI-32, OPENHUMAN-TAURI-5Z).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR classifies transport-level “can’t reach server” failures as ChangesNetwork-unreachable observability handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
…layer (OPENHUMAN-TAURI-5Z)
The reliable-provider stack already retries 408/429/502/503/504 and the
`before_send` filter drops the per-attempt `domain=llm_provider` events.
But once retries exhaust, the same error is returned via `Result::Err`,
bubbles up to `agent.run_single`, and gets re-reported via `report_error`
under `domain=agent` — escaping the provider-scoped filter. One Sentry
event per failed agent turn for a transient infra blip (the reported
event was a 504 from the upstream `api.tinyhumans.ai` proxy).
Add `ExpectedErrorKind::TransientUpstreamHttp` keyed off the canonical
`"<provider> API error (<status>): …"` shape from `providers::ops::api_error`,
pinned to `TRANSIENT_PROVIDER_HTTP_STATUSES` (408/429/502/503/504) and
anchored on the `"api error ("` prefix so a free-form mention of "504"
elsewhere isn't silenced. Switch the `agent.run_single` error site from
`report_error` to `report_error_or_expected` so it picks up the classifier;
`Result::Err` return and `DomainEvent::AgentError` publish are unchanged,
so user-visible behavior is unaffected — only the Sentry event is suppressed.
Same pattern as param-validation (4a36b4f), budget-exceeded (c7ac365),
transient-upstream-HTTP at the provider layer (afdc268), and
transport-level network errors (49c1263).
graycyrus
left a comment
There was a problem hiding this comment.
PR Review — fix(observability): skip Sentry for transport-level + transient-upstream errors
Walkthrough
Adds two new ExpectedErrorKind variants (NetworkUnreachable, TransientUpstreamHttp) to src/core/observability.rs and switches four call sites from report_error → report_error_or_expected. Transport-level connection failures and exhausted-retry transient HTTP responses are demoted to tracing::warn! breadcrumbs instead of Sentry error events. User-visible behavior (chat_error, AgentError, integration error returns) is entirely preserved.
Changes
| File | Summary |
|---|---|
src/core/observability.rs |
New NetworkUnreachable (8 transport substrings) + TransientUpstreamHttp (TRANSIENT_PROVIDER_HTTP_STATUSES × "api error (" prefix) variants; report_expected_message match arms; 4 new unit tests |
src/openhuman/agent/harness/session/runtime.rs |
Agent::run_single → report_error_or_expected |
src/openhuman/channels/providers/web.rs |
run_chat_task → report_error_or_expected |
src/openhuman/integrations/client.rs |
IntegrationClient::{post,get} → report_error_or_expected |
Findings
[major] extra tags silently dropped on expected path (observability.rs:94-96)
When an error IS classified as expected, report_error_or_expected calls report_expected_message(kind, &message, domain, operation) — the extra slice is not forwarded. Caller-supplied tags (session_id, error_kind, path, thread_id, request_id) are silently dropped from the warn breadcrumb.
This is pre-existing (same for LocalAiDisabled/ApiKeyMissing), but the two new variants are exactly where those tags matter most for ops debugging. In client.rs, path is the only way to know which integration endpoint was affected — and it's currently thrown away.
Suggestion: Forward extra to report_expected_message and include the tags as structured fields in the tracing::warn! calls.
[major] format! allocation inside .any() in is_transient_upstream_http_message (observability.rs:104-108)
Allocates a new String per status code per call. Error path so no perf regression, but unnecessary and diverges from the &str-only style of is_network_unreachable_message:
// suggestion: pre-compute as compile-time constants
fn is_transient_upstream_http_message(lower: &str) -> bool {
const PATTERNS: &[&str] = &[
"api error (408",
"api error (429",
"api error (502",
"api error (503",
"api error (504",
];
PATTERNS.iter().any(|p| lower.contains(p))
}[minor] "connection reset" substring match may be too broad (observability.rs:80)
A non-transient 500 whose response body contains "connection reset" (e.g. nginx upstream reset) would be silenced as NetworkUnreachable. Low-probability but worth a targeted regression guard test:
#[test]
fn does_not_classify_provider_error_with_connection_reset_body_as_network() {
assert_eq!(
expected_error_kind(
"Provider API error (500): upstream connection reset while reading response"
),
None
);
}If this test fails, tighten the match to exclude strings that contain the "api error (" provider prefix.
Nitpicks
report_expected_messagesays"skipped transient upstream HTTP error"but other arms say"skipped expected …"— inconsistent grep-ability.is_transient_upstream_http_messageclosing)))is visually dense — confirmcargo fmtis happy.runtime.rs:519passes("error_kind", sanitized_message.as_str())as an extra tag, but for classified errors this tag is dropped (see finding 1), making the call-site signature misleading.
Verified / Looks Good
TransientUpstreamHttpcorrectly anchored to"api error ("prefix — bare "504" mentions don't match ✓NetworkUnreachableoperates on reqwest error chain, not HTTP body content inclient.rs✓DomainEvent::AgentErrorpublish andErrreturn unconditionally preserved inruntime.rs✓classify_inference_errorinweb.rsruns before thedetailedstring → user-visiblechat_errorunaffected ✓TRANSIENT_PROVIDER_HTTP_STATUSESreused (single source of truth) ✓- Negative guard tests lock 404/500 outside both classifiers ✓
- Existing E2E test
start_chat_emits_sanitized_chat_error_on_inference_failureexercises the new code path ✓
Overall: clean, well-motivated PR with solid test coverage. The two major items are real but straightforward fixes.
…y classifier (tinyhumansai#1608) Switches `channels::runtime::dispatch`'s LLM-error re-emit at the chat-task funnel from raw `report_error` to `report_error_or_expected`. The dispatch layer was the actual leak source for OPENHUMAN-TAURI-4F (~157 events) / -1C (~87 events) / -8F (~39 events): the reliable provider layer retried 5xx, the agent re-raised, `agent.run_single` correctly demoted via the classifier — and then channels.dispatch called raw `report_error(&e, "channels", "dispatch_llm_error", …)` which fires Sentry unconditionally regardless of message content, re-creating the per-attempt event we had just suppressed. Routing through `report_error_or_expected` lets `is_transient_upstream_http_message` match the canonical `"OpenHuman API error (NNN ...)"` substring still anchored in the chain after agent + harness wrapping, demoting it to a warn breadcrumb. Genuine bugs (404 / 500 / unrelated agent failures) still surface because the classifier only matches the documented transient shapes. Mirrors the `is_max_iterations_error` short-circuit added in tinyhumansai#1601 — same site, same file, same reasoning (don't re-emit a deterministic outcome that has already been classified upstream). Adds `channels_dispatch_re_emit_of_provider_502_classifies_as_transient` in observability tests covering three real-world wrapping shapes (bare provider error, agent.provider_chat prefix, and all-providers-exhausted prefix) so a future regression in the classifier or in the chain-rendering surfaces here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y classifier (tinyhumansai#1608) Switches `channels::runtime::dispatch`'s LLM-error re-emit at the chat-task funnel from raw `report_error` to `report_error_or_expected`. The dispatch layer was the actual leak source for OPENHUMAN-TAURI-4F (~157 events) / -1C (~87 events) / -8F (~39 events): the reliable provider layer retried 5xx, the agent re-raised, `agent.run_single` correctly demoted via the classifier — and then channels.dispatch called raw `report_error(&e, "channels", "dispatch_llm_error", …)` which fires Sentry unconditionally regardless of message content, re-creating the per-attempt event we had just suppressed. Routing through `report_error_or_expected` lets `is_transient_upstream_http_message` match the canonical `"OpenHuman API error (NNN ...)"` substring still anchored in the chain after agent + harness wrapping, demoting it to a warn breadcrumb. Genuine bugs (404 / 500 / unrelated agent failures) still surface because the classifier only matches the documented transient shapes. Mirrors the `is_max_iterations_error` short-circuit added in tinyhumansai#1601 — same site, same file, same reasoning (don't re-emit a deterministic outcome that has already been classified upstream). Adds `channels_dispatch_re_emit_of_provider_502_classifies_as_transient` in observability tests covering three real-world wrapping shapes (bare provider error, agent.provider_chat prefix, and all-providers-exhausted prefix) so a future regression in the classifier or in the chain-rendering surfaces here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eam errors (TAURI-32 / 5Z / 2G) (tinyhumansai#1601)
Summary
Consolidates three transport-level Sentry-noise fixes (previously split across #1594 + #1601) into a single PR. All three share the same classifier (
ExpectedErrorKind::NetworkUnreachable/TransientUpstreamHttpinsrc/core/observability.rs) and the same call-site pattern (swapreport_error→report_error_or_expectedso the classifier picks up the error chain). One review, one merge — no duplicated classifier commits across branches.Covers
web_channel.run_chat_task(reqwest transport errors: DNS / TCP / TLS / cert / ISP-block shapes, observed from RU user whereapi.tinyhumans.aiwas unreachable). Adds theNetworkUnreachablevariant +is_network_unreachable_messagematching 8 transport shapes.agent.run_singleaggregate path. Adds theTransientUpstreamHttpvariant +is_transient_upstream_http_messagepinned to the"<provider> API error (<status>"wire format withTRANSIENT_PROVIDER_HTTP_STATUSES(408/429/502/503/504), so it catches per-turn 5xx/429/timeout that has escaped the provider-scopedbefore_sendfilter under a differentdomaintag.IntegrationClient::postandIntegrationClient::get(17 events from a SG user, alltls handshake eof). Same classifier picks them up.Problem
Transport-level reqwest failures fire before any HTTP status is observed. There is no status, no trace, no payload — nothing Sentry can group, facet, or action on. The reliable-provider layer already retries with backoff/fallback; when those exhaust, downstream call sites (
web_channel,agent.run_single,integrations.{get,post}) re-emitreport_erroron top of the aggregate event. One user with a flaky connection (VPN drop, captive portal, ISP MITM, transient handshake reset) produces a sustained stream of identical "events" that are pure environmental noise.The same noise pattern hits the agent layer when transient upstream HTTP statuses (408/429/502/503/504) escape the provider-scoped
before_sendfilter under adomain=agenttag — handled here by the second classifier variant.Solution
src/core/observability.rs:ExpectedErrorKind::NetworkUnreachableandis_network_unreachable_messagematching 8 transport-level shapes (error sending request for url,dns error,connection refused,connection reset,network is unreachable,no route to host,tls handshake,certificate verify failed).ExpectedErrorKind::TransientUpstreamHttpandis_transient_upstream_http_messagepinned to the"<provider> API error (<status>"wire format fromproviders::ops::api_errorforTRANSIENT_PROVIDER_HTTP_STATUSES(408/429/502/503/504) — anchored to the"api error ("prefix so a free-form mention of "504" elsewhere isn't silenced.report_expected_messagehandles both new variants withtracing::warn!, sosentry-tracingemits a breadcrumb instead of an error event.src/openhuman/channels/providers/web.rs:run_chat_taskfailure site fromreport_error→report_error_or_expectedso it routes through the classifier.chat_errorevent is unchanged — still goes throughclassify_inference_error, still emits the sanitized generic message via the existing socket bridge.src/openhuman/agent/harness/session/runtime.rs:Agent::run_singlefromreport_error→report_error_or_expected.Result::Errreturn andDomainEvent::AgentErrorpublish are unchanged — only the Sentry event is suppressed.src/openhuman/integrations/client.rs:IntegrationClient::postandIntegrationClient::gettransport-failure sites fromreport_error→report_error_or_expected. Non-2xx and envelope-error paths are unchanged — those are actionable backend failures and should continue to surface.Status-bearing failures (404 / 500 / etc.) outside the curated transient set are untouched by the new classifier and still surface via their existing paths.
Submission Checklist
core/observability.rs:classifies_network_unreachable_errors(8 transport-level patterns including the literal OPENHUMAN-TAURI-32 / TAURI-2G message bodies),classifies_transient_upstream_http_errors(the 5 transient codes against the canonical"API error (<status>"wire format), and two regression guards:does_not_classify_unrelated_provider_errors_as_networkanddoes_not_classify_actionable_provider_errors_as_transient_upstream(locks 404 / 500 outside both classifiers). Existingstart_chat_emits_sanitized_chat_error_on_inference_failurealready drives the new code path end-to-end with the exact"error sending request for url (…)"payload and stays green.observability.rsare covered by the new tests; the one-line call-site swaps inweb.rs,runtime.rs, andclient.rsare exercised by existing forced-error tests in those modules.## Related— see below.## Related.Impact
sanitize_api_errorupstream; this PR only changes the log level / Sentry routing, not the message content.to_ascii_lowercase+containslookup per error site that callsreport_error_or_expected.report_errorare unaffected.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issues
Commit & Branch
fix/integrations-transport-sentry-2g(based onorigin/main)25b1a998(NetworkUnreachable + web channel) ·c41f8416(integrations client) ·202a82e3(TransientUpstreamHttp + agent runtime — cherry-pick from fix(observability): skip Sentry for net errors (OPENHUMAN-TAURI-32) #1594)Validation Run
pnpm --filter openhuman-app format:check— N/A: Rust-only change.pnpm typecheck— N/A: Rust-only change.cargo test --manifest-path Cargo.toml --lib core::observability::(15/15 pass on the combined branch).cargo check --manifest-path Cargo.toml(clean, only pre-existing dead-code warnings).app/src-tauri/changes.Validation Blocked
command:pnpm pre-push(lint:commands-tokens /react-hooks/set-state-in-effect)error:regex/lint hits onsrc/components/commands/Tailwind tokens and pre-existing React-hooks warnings inBootCheckGate.tsx,RotatingTetrahedronCanvas.tsx,CommandProvider.tsx,TriggerToggles.tsx,MemoryNavigator.tsx— unrelated to this PR's Rust-only diff.impact:Pushed with--no-verifyper CLAUDE.md guidance for hook failures unrelated to the changes.Known CI Failures (pre-existing on
main)Rust Core Tests + QualityandRust Core Coverage (cargo-llvm-cov)will be red on this PR with the same 38 test failures that also fail onmain(e.g. main run25785337660on commitde33b129). Failing tests are concentrated inopenhuman::config::ops::*,openhuman::credentials::cli::*,openhuman::local_ai::schemas::*,openhuman::subconscious::executor::*,openhuman::threads::ops::*,openhuman::update::ops::*, pluscore::jsonrpc::tests::thread_not_found_rpc_error_does_not_report_to_sentry— all infrastructure flakiness from shared global state (TEST_ENV_LOCK, sentry hub, tracing subscriber). All 38 pass locally in isolation; specificallycore::jsonrpc::tests::thread_not_found_rpc_error_does_not_report_to_sentrypasses and proves "unrelated RPC errors still reach Sentry" — i.e. the combined classifier does not over-match.Behavior Changes
web_channel.run_chat_task,agent.run_single, andIntegrationClient::{get,post}no longer create Sentry error events; they emit awarn-level breadcrumb instead.chat_errormessage and integration error surfaces are unchanged.Parity Contract
report_errorcontinues to emit error events at all other call sites; onlyreport_error_or_expectedgates on the classifier, and only the newNetworkUnreachable/TransientUpstreamHttpshapes are added — existingLocalAiDisabled/ApiKeyMissingbehavior is unchanged.does_not_classify_unrelated_provider_errors_as_networkanddoes_not_classify_actionable_provider_errors_as_transient_upstreamlock 404 / 500 outside the new classifiers.Duplicate / Superseded PR Handling
fix/observability-network-unreachable-sentry) — same classifier + web channel + agent runtime changes.a21dd58a) is cherry-picked into this branch as202a82e3.Summary by CodeRabbit
Bug Fixes
Tests