Skip to content

fix(observability): skip Sentry for transport-level + transient-upstream errors (TAURI-32 / 5Z / 2G)#1601

Merged
senamakel merged 4 commits into
tinyhumansai:mainfrom
CodeGhost21:fix/integrations-transport-sentry-2g
May 13, 2026
Merged

fix(observability): skip Sentry for transport-level + transient-upstream errors (TAURI-32 / 5Z / 2G)#1601
senamakel merged 4 commits into
tinyhumansai:mainfrom
CodeGhost21:fix/integrations-transport-sentry-2g

Conversation

@CodeGhost21
Copy link
Copy Markdown
Contributor

@CodeGhost21 CodeGhost21 commented May 13, 2026

Summary

Consolidates three transport-level Sentry-noise fixes (previously split across #1594 + #1601) into a single PR. All three share the same classifier (ExpectedErrorKind::NetworkUnreachable / TransientUpstreamHttp in src/core/observability.rs) and the same call-site pattern (swap report_errorreport_error_or_expected so the classifier picks up the error chain). One review, one merge — no duplicated classifier commits across branches.

Covers

  • OPENHUMAN-TAURI-32web_channel.run_chat_task (reqwest transport errors: DNS / TCP / TLS / cert / ISP-block shapes, observed from RU user where api.tinyhumans.ai was unreachable). Adds the NetworkUnreachable variant + is_network_unreachable_message matching 8 transport shapes.
  • OPENHUMAN-TAURI-5Zagent.run_single aggregate path. Adds the TransientUpstreamHttp variant + is_transient_upstream_http_message pinned to the "<provider> API error (<status>" wire format with TRANSIENT_PROVIDER_HTTP_STATUSES (408/429/502/503/504), so it catches per-turn 5xx/429/timeout that has escaped the provider-scoped before_send filter under a different domain tag.
  • OPENHUMAN-TAURI-2GIntegrationClient::post and IntegrationClient::get (17 events from a SG user, all tls handshake eof). Same classifier picks them up.

Problem

Transport-level reqwest failures fire before any HTTP status is observed. There is no status, no trace, no payload — nothing Sentry can group, facet, or action on. The reliable-provider layer already retries with backoff/fallback; when those exhaust, downstream call sites (web_channel, agent.run_single, integrations.{get,post}) re-emit report_error on top of the aggregate event. One user with a flaky connection (VPN drop, captive portal, ISP MITM, transient handshake reset) produces a sustained stream of identical "events" that are pure environmental noise.

The same noise pattern hits the agent layer when transient upstream HTTP statuses (408/429/502/503/504) escape the provider-scoped before_send filter under a domain=agent tag — handled here by the second classifier variant.

Solution

src/core/observability.rs:

  • Add ExpectedErrorKind::NetworkUnreachable and is_network_unreachable_message matching 8 transport-level shapes (error sending request for url, dns error, connection refused, connection reset, network is unreachable, no route to host, tls handshake, certificate verify failed).
  • Add ExpectedErrorKind::TransientUpstreamHttp and is_transient_upstream_http_message pinned to the "<provider> API error (<status>" wire format from providers::ops::api_error for TRANSIENT_PROVIDER_HTTP_STATUSES (408/429/502/503/504) — anchored to the "api error (" prefix so a free-form mention of "504" elsewhere isn't silenced.
  • report_expected_message handles both new variants with tracing::warn!, so sentry-tracing emits a breadcrumb instead of an error event.

src/openhuman/channels/providers/web.rs:

  • Switch the run_chat_task failure site from report_errorreport_error_or_expected so it routes through the classifier.
  • The user-facing chat_error event is unchanged — still goes through classify_inference_error, still emits the sanitized generic message via the existing socket bridge.

src/openhuman/agent/harness/session/runtime.rs:

  • Switch Agent::run_single from report_errorreport_error_or_expected. Result::Err return and DomainEvent::AgentError publish are unchanged — only the Sentry event is suppressed.

src/openhuman/integrations/client.rs:

  • Switch IntegrationClient::post and IntegrationClient::get transport-failure sites from report_errorreport_error_or_expected. Non-2xx and envelope-error paths are unchanged — those are actionable backend failures and should continue to surface.

Status-bearing failures (404 / 500 / etc.) outside the curated transient set are untouched by the new classifier and still surface via their existing paths.

Submission Checklist

  • Tests added or updated — three new unit tests in core/observability.rs: classifies_network_unreachable_errors (8 transport-level patterns including the literal OPENHUMAN-TAURI-32 / TAURI-2G message bodies), classifies_transient_upstream_http_errors (the 5 transient codes against the canonical "API error (<status>" wire format), and two regression guards: does_not_classify_unrelated_provider_errors_as_network and does_not_classify_actionable_provider_errors_as_transient_upstream (locks 404 / 500 outside both classifiers). Existing start_chat_emits_sanitized_chat_error_on_inference_failure already drives the new code path end-to-end with the exact "error sending request for url (…)" payload and stays green.
  • Diff coverage ≥ 80% — N/A locally; will run via CI. All new lines in observability.rs are covered by the new tests; the one-line call-site swaps in web.rs, runtime.rs, and client.rs are exercised by existing forced-error tests in those modules.
  • Coverage matrix updated — N/A: behaviour-only change (Sentry classification, no new user-visible feature row).
  • All affected feature IDs listed under ## Related — see below.
  • No new external network dependencies introduced.
  • Manual smoke checklist updated — N/A: change is on the error-reporting side, not a release-cut surface.
  • Linked issue — see ## Related.

Impact

  • Runtime: desktop only (web channel, agent harness, and integrations client all run in the core sidecar).
  • Security: no change. Error text was already redacted via sanitize_api_error upstream; this PR only changes the log level / Sentry routing, not the message content.
  • Performance: negligible — one extra to_ascii_lowercase + contains lookup per error site that calls report_error_or_expected.
  • Migration / compatibility: none. Existing callers of report_error are unaffected.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issues

Commit & Branch

Validation Run

  • pnpm --filter openhuman-app format:check — N/A: Rust-only change.
  • pnpm typecheck — N/A: Rust-only change.
  • Focused tests: cargo test --manifest-path Cargo.toml --lib core::observability:: (15/15 pass on the combined branch).
  • Rust fmt/check: cargo check --manifest-path Cargo.toml (clean, only pre-existing dead-code warnings).
  • Tauri fmt/check: N/A — no app/src-tauri/ changes.

Validation Blocked

  • command: pnpm pre-push (lint:commands-tokens / react-hooks/set-state-in-effect)
  • error: regex/lint hits on src/components/commands/ Tailwind tokens and pre-existing React-hooks warnings in BootCheckGate.tsx, RotatingTetrahedronCanvas.tsx, CommandProvider.tsx, TriggerToggles.tsx, MemoryNavigator.tsx — unrelated to this PR's Rust-only diff.
  • impact: Pushed with --no-verify per CLAUDE.md guidance for hook failures unrelated to the changes.

Known CI Failures (pre-existing on main)

  • Rust Core Tests + Quality and Rust Core Coverage (cargo-llvm-cov) will be red on this PR with the same 38 test failures that also fail on main (e.g. main run 25785337660 on commit de33b129). Failing tests are concentrated in openhuman::config::ops::*, openhuman::credentials::cli::*, openhuman::local_ai::schemas::*, openhuman::subconscious::executor::*, openhuman::threads::ops::*, openhuman::update::ops::*, plus core::jsonrpc::tests::thread_not_found_rpc_error_does_not_report_to_sentry — all infrastructure flakiness from shared global state (TEST_ENV_LOCK, sentry hub, tracing subscriber). All 38 pass locally in isolation; specifically core::jsonrpc::tests::thread_not_found_rpc_error_does_not_report_to_sentry passes and proves "unrelated RPC errors still reach Sentry" — i.e. the combined classifier does not over-match.

Behavior Changes

  • Intended behavior change: transport-level connection failures and transient upstream HTTP failures (408/429/502/503/504) from web_channel.run_chat_task, agent.run_single, and IntegrationClient::{get,post} no longer create Sentry error events; they emit a warn-level breadcrumb instead.
  • User-visible effect: none. The on-screen chat_error message and integration error surfaces are unchanged.

Parity Contract

  • Legacy behavior preserved: report_error continues to emit error events at all other call sites; only report_error_or_expected gates on the classifier, and only the new NetworkUnreachable / TransientUpstreamHttp shapes are added — existing LocalAiDisabled / ApiKeyMissing behavior is unchanged.
  • Guard/fallback/dispatch parity checks: does_not_classify_unrelated_provider_errors_as_network and does_not_classify_actionable_provider_errors_as_transient_upstream lock 404 / 500 outside the new classifiers.

Duplicate / Superseded PR Handling

Summary by CodeRabbit

  • Bug Fixes

    • Improved detection of transport-level network failures (DNS/TCP/TLS/connectivity) so they’re classified as expected and logged as warnings instead of triggering error events.
    • Updated chat, integration client, and agent runtime reporting to use the new expected-vs-unexpected handling, reducing noisy alerts.
  • Tests

    • Added unit tests covering multiple network-failure message variants and verifying provider errors remain correctly classified.

Review Change Stack

…PENHUMAN-TAURI-32)

reqwest's "error sending request for url (…)" fires before any HTTP
status when DNS / TCP / TLS handshake fails, or when the user's ISP
blocks the route to api.tinyhumans.ai (the impacted event came from a
RU user). No status, no trace, no payload — Sentry has no signal to
act on, and every retry exhaustion produces another noisy event.

Classify the transport-level shapes (error sending request, DNS error,
connection refused/reset, network unreachable, no route to host, TLS
handshake, cert verify) as ExpectedErrorKind::NetworkUnreachable so
report_error_or_expected logs a warn-level breadcrumb instead of
spawning a Sentry error event. Switch the web_channel.run_chat_task
failure site from report_error to report_error_or_expected so it
picks up the classifier.

Same pattern as param-validation (4a36b4f), budget-exceeded
(c7ac365), and transient-upstream-HTTP (afdc268).
…ures (OPENHUMAN-TAURI-2G)

reqwest's "error sending request for url (…) → tls handshake eof"
fires before any HTTP status when the user's TLS path to
api.tinyhumans.ai breaks — captive portal, ISP MITM, transient
handshake reset (the impacted event came from a SG user). The
observability classifier already recognizes "tls handshake" /
"error sending request for url" as
ExpectedErrorKind::NetworkUnreachable, but the integrations client
was calling `report_error` directly on the transport path, bypassing
the classifier. One Sentry event per failed GET/POST.

Switch the two transport-failure sites in `integrations/client.rs`
(`post` and `get`) from `report_error` to `report_error_or_expected`
so the classifier picks them up and emits a warn-level breadcrumb
instead of a Sentry error event. Non-2xx and envelope-error paths
are unchanged — those are actionable backend failures.

Same pattern as web_channel.run_chat_task and agent.run_single
(OPENHUMAN-TAURI-32, OPENHUMAN-TAURI-5Z).
@CodeGhost21 CodeGhost21 requested a review from a team May 13, 2026 07:56
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31364b1b-56ce-429f-b6a7-ec94a4cb7e84

📥 Commits

Reviewing files that changed from the base of the PR and between 202a82e and a4919cb.

📒 Files selected for processing (1)
  • src/core/observability.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/core/observability.rs

📝 Walkthrough

Walkthrough

This PR classifies transport-level “can’t reach server” failures as NetworkUnreachable and logs them as warn-level expected events. It updates three call sites (web chat, integrations client POST/GET, agent runtime) to use the expected-error reporting path and adds unit tests for classification behavior.

Changes

Network-unreachable observability handling

Layer / File(s) Summary
NetworkUnreachable variant and detection logic
src/core/observability.rs
ExpectedErrorKind gains NetworkUnreachable. expected_error_kind adds detection for transport/DNS/TCP/TLS/certificate/connectivity failure markers via is_network_unreachable_message and transient-upstream checks. Unit tests validate positive matches and non-classification of status-bearing provider errors.
Expected message reporting for NetworkUnreachable
src/core/observability.rs
report_expected_message handles NetworkUnreachable by emitting a standardized tracing::warn! breadcrumb (“skipped expected network-unreachable error”) instead of reporting an error event.
Web chat error reporting integration
src/openhuman/channels/providers/web.rs
start_chat task error handler switches from report_error to report_error_or_expected, routing transport failures through the new classification and documenting the warn-level breadcrumb behavior.
Integration client error reporting updates
src/openhuman/integrations/client.rs
IntegrationClient::post and IntegrationClient::get .map_err paths now call report_error_or_expected (instead of report_error), preserving enriched error chains but classifying common reqwest transport/connect/TLS failures as expected.
Agent runtime error reporting update
src/openhuman/agent/harness/session/runtime.rs
Agent::run_single Err branch now calls report_error_or_expected so exhausted retries / transient upstream failures are demoted to expected warn-level breadcrumbs while preserving existing AgentError publish and error return.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • tinyhumansai/openhuman#1264: Modifies src/core/observability.rs and callsites to change error reporting foundations; related to this PR's classification extensions.
  • tinyhumansai/openhuman#1529: Suppresses Sentry reports for transient upstream HTTP failures; conceptually overlaps with transient-upstream handling here.
  • tinyhumansai/openhuman#1512: Adjusts expected classification for provider HTTP statuses (e.g., 429); related to transient-status classification.

Suggested reviewers

  • senamakel

Poem

🐰 I sniffed the wires and gave a twitch—
A DNS hiccup, a TLS glitch.
Not every fall needs a siren's song,
Now we log a whisper when networks go wrong.
Hopping on, the observability hop! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing observability fixes to skip Sentry reporting for transport-level and transient upstream HTTP errors while preserving error handling.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 13, 2026
…layer (OPENHUMAN-TAURI-5Z)

The reliable-provider stack already retries 408/429/502/503/504 and the
`before_send` filter drops the per-attempt `domain=llm_provider` events.
But once retries exhaust, the same error is returned via `Result::Err`,
bubbles up to `agent.run_single`, and gets re-reported via `report_error`
under `domain=agent` — escaping the provider-scoped filter. One Sentry
event per failed agent turn for a transient infra blip (the reported
event was a 504 from the upstream `api.tinyhumans.ai` proxy).

Add `ExpectedErrorKind::TransientUpstreamHttp` keyed off the canonical
`"<provider> API error (<status>): …"` shape from `providers::ops::api_error`,
pinned to `TRANSIENT_PROVIDER_HTTP_STATUSES` (408/429/502/503/504) and
anchored on the `"api error ("` prefix so a free-form mention of "504"
elsewhere isn't silenced. Switch the `agent.run_single` error site from
`report_error` to `report_error_or_expected` so it picks up the classifier;
`Result::Err` return and `DomainEvent::AgentError` publish are unchanged,
so user-visible behavior is unaffected — only the Sentry event is suppressed.

Same pattern as param-validation (4a36b4f), budget-exceeded (c7ac365),
transient-upstream-HTTP at the provider layer (afdc268), and
transport-level network errors (49c1263).
@CodeGhost21 CodeGhost21 changed the title fix(observability): skip Sentry for transport-level integrations failures (OPENHUMAN-TAURI-2G) fix(observability): skip Sentry for transport-level + transient-upstream errors (TAURI-32 / 5Z / 2G) May 13, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review — fix(observability): skip Sentry for transport-level + transient-upstream errors

Walkthrough

Adds two new ExpectedErrorKind variants (NetworkUnreachable, TransientUpstreamHttp) to src/core/observability.rs and switches four call sites from report_errorreport_error_or_expected. Transport-level connection failures and exhausted-retry transient HTTP responses are demoted to tracing::warn! breadcrumbs instead of Sentry error events. User-visible behavior (chat_error, AgentError, integration error returns) is entirely preserved.

Changes

File Summary
src/core/observability.rs New NetworkUnreachable (8 transport substrings) + TransientUpstreamHttp (TRANSIENT_PROVIDER_HTTP_STATUSES × "api error (" prefix) variants; report_expected_message match arms; 4 new unit tests
src/openhuman/agent/harness/session/runtime.rs Agent::run_singlereport_error_or_expected
src/openhuman/channels/providers/web.rs run_chat_taskreport_error_or_expected
src/openhuman/integrations/client.rs IntegrationClient::{post,get}report_error_or_expected

Findings

[major] extra tags silently dropped on expected path (observability.rs:94-96)

When an error IS classified as expected, report_error_or_expected calls report_expected_message(kind, &message, domain, operation) — the extra slice is not forwarded. Caller-supplied tags (session_id, error_kind, path, thread_id, request_id) are silently dropped from the warn breadcrumb.

This is pre-existing (same for LocalAiDisabled/ApiKeyMissing), but the two new variants are exactly where those tags matter most for ops debugging. In client.rs, path is the only way to know which integration endpoint was affected — and it's currently thrown away.

Suggestion: Forward extra to report_expected_message and include the tags as structured fields in the tracing::warn! calls.

[major] format! allocation inside .any() in is_transient_upstream_http_message (observability.rs:104-108)

Allocates a new String per status code per call. Error path so no perf regression, but unnecessary and diverges from the &str-only style of is_network_unreachable_message:

// suggestion: pre-compute as compile-time constants
fn is_transient_upstream_http_message(lower: &str) -> bool {
    const PATTERNS: &[&str] = &[
        "api error (408",
        "api error (429",
        "api error (502",
        "api error (503",
        "api error (504",
    ];
    PATTERNS.iter().any(|p| lower.contains(p))
}

[minor] "connection reset" substring match may be too broad (observability.rs:80)

A non-transient 500 whose response body contains "connection reset" (e.g. nginx upstream reset) would be silenced as NetworkUnreachable. Low-probability but worth a targeted regression guard test:

#[test]
fn does_not_classify_provider_error_with_connection_reset_body_as_network() {
    assert_eq!(
        expected_error_kind(
            "Provider API error (500): upstream connection reset while reading response"
        ),
        None
    );
}

If this test fails, tighten the match to exclude strings that contain the "api error (" provider prefix.

Nitpicks

  • report_expected_message says "skipped transient upstream HTTP error" but other arms say "skipped expected …" — inconsistent grep-ability.
  • is_transient_upstream_http_message closing ))) is visually dense — confirm cargo fmt is happy.
  • runtime.rs:519 passes ("error_kind", sanitized_message.as_str()) as an extra tag, but for classified errors this tag is dropped (see finding 1), making the call-site signature misleading.

Verified / Looks Good

  • TransientUpstreamHttp correctly anchored to "api error (" prefix — bare "504" mentions don't match ✓
  • NetworkUnreachable operates on reqwest error chain, not HTTP body content in client.rs
  • DomainEvent::AgentError publish and Err return unconditionally preserved in runtime.rs
  • classify_inference_error in web.rs runs before the detailed string → user-visible chat_error unaffected ✓
  • TRANSIENT_PROVIDER_HTTP_STATUSES reused (single source of truth) ✓
  • Negative guard tests lock 404/500 outside both classifiers ✓
  • Existing E2E test start_chat_emits_sanitized_chat_error_on_inference_failure exercises the new code path ✓

Overall: clean, well-motivated PR with solid test coverage. The two major items are real but straightforward fixes.

@senamakel senamakel merged commit a13f842 into tinyhumansai:main May 13, 2026
21 checks passed
oxoxDev added a commit to oxoxDev/openhuman that referenced this pull request May 15, 2026
…y classifier (tinyhumansai#1608)

Switches `channels::runtime::dispatch`'s LLM-error re-emit at the
chat-task funnel from raw `report_error` to `report_error_or_expected`.
The dispatch layer was the actual leak source for OPENHUMAN-TAURI-4F
(~157 events) / -1C (~87 events) / -8F (~39 events): the reliable
provider layer retried 5xx, the agent re-raised, `agent.run_single`
correctly demoted via the classifier — and then channels.dispatch
called raw `report_error(&e, "channels", "dispatch_llm_error", …)`
which fires Sentry unconditionally regardless of message content,
re-creating the per-attempt event we had just suppressed.

Routing through `report_error_or_expected` lets
`is_transient_upstream_http_message` match the canonical
`"OpenHuman API error (NNN ...)"` substring still anchored in the
chain after agent + harness wrapping, demoting it to a warn breadcrumb.
Genuine bugs (404 / 500 / unrelated agent failures) still surface
because the classifier only matches the documented transient shapes.

Mirrors the `is_max_iterations_error` short-circuit added in tinyhumansai#1601 —
same site, same file, same reasoning (don't re-emit a deterministic
outcome that has already been classified upstream).

Adds `channels_dispatch_re_emit_of_provider_502_classifies_as_transient`
in observability tests covering three real-world wrapping shapes
(bare provider error, agent.provider_chat prefix, and
all-providers-exhausted prefix) so a future regression in the
classifier or in the chain-rendering surfaces here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
oxoxDev added a commit to oxoxDev/openhuman that referenced this pull request May 15, 2026
…y classifier (tinyhumansai#1608)

Switches `channels::runtime::dispatch`'s LLM-error re-emit at the
chat-task funnel from raw `report_error` to `report_error_or_expected`.
The dispatch layer was the actual leak source for OPENHUMAN-TAURI-4F
(~157 events) / -1C (~87 events) / -8F (~39 events): the reliable
provider layer retried 5xx, the agent re-raised, `agent.run_single`
correctly demoted via the classifier — and then channels.dispatch
called raw `report_error(&e, "channels", "dispatch_llm_error", …)`
which fires Sentry unconditionally regardless of message content,
re-creating the per-attempt event we had just suppressed.

Routing through `report_error_or_expected` lets
`is_transient_upstream_http_message` match the canonical
`"OpenHuman API error (NNN ...)"` substring still anchored in the
chain after agent + harness wrapping, demoting it to a warn breadcrumb.
Genuine bugs (404 / 500 / unrelated agent failures) still surface
because the classifier only matches the documented transient shapes.

Mirrors the `is_max_iterations_error` short-circuit added in tinyhumansai#1601 —
same site, same file, same reasoning (don't re-emit a deterministic
outcome that has already been classified upstream).

Adds `channels_dispatch_re_emit_of_provider_502_classifies_as_transient`
in observability tests covering three real-world wrapping shapes
(bare provider error, agent.provider_chat prefix, and
all-providers-exhausted prefix) so a future regression in the
classifier or in the chain-rendering surfaces here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants