Skip to content

fix(socket): route sustained-outage escalation through observability classifier (OPENHUMAN-TAURI-BH)#1672

Merged
senamakel merged 1 commit into
tinyhumansai:mainfrom
CodeGhost21:fix/socket-skip-network-unreachable-bh
May 13, 2026
Merged

fix(socket): route sustained-outage escalation through observability classifier (OPENHUMAN-TAURI-BH)#1672
senamakel merged 1 commit into
tinyhumansai:mainfrom
CodeGhost21:fix/socket-skip-network-unreachable-bh

Conversation

@CodeGhost21
Copy link
Copy Markdown
Contributor

@CodeGhost21 CodeGhost21 commented May 13, 2026

Summary

  • Routes ws_loop's threshold-escalation log (introduced in fix(socket): suppress retry-storm Sentry noise + empty-token guard (OPENHUMAN-TAURI-8M) #1568 for OPENHUMAN-TAURI-8M) through crate::core::observability::report_error_or_expected so transport-level user-environment shapes (Network is unreachable, DNS error, connection refused/reset, TLS handshake) demote to a warn breadcrumb instead of firing a Sentry event.
  • Genuine outage shapes (gateway 5xx, server-side WebSocket close, malformed handshake) don't match the classifier and still fire exactly one Sentry event per affected client — preserving the OPENHUMAN-TAURI-8M intent.
  • Two regression tests pin the wire format the escalation builds to the classifier match so neither side can drift silently.

Why

OPENHUMAN-TAURI-BH: an offline Mac (os error 51) was firing the one-shot sustained-outage log::error! because the call site bypassed the classifier in src/core/observability.rs. The classifier already handles network is unreachable (added for OPENHUMAN-TAURI-32) — six other call sites (web_channel.run_chat_task, integrations.client, providers/reliable, agent.harness.tool_loop, agent.harness.runtime) already route through report_error_or_expected. The socket loop was the odd one out.

No new classifier skip rule is being added — this PR just wires the socket call site through the existing one.

Fixes OPENHUMAN-TAURI-BH

Test plan

  • cargo check --manifest-path Cargo.toml --tests — clean (only pre-existing warnings)
  • cargo test --lib openhuman::socket::ws_loop — 33 passed, 0 failed (includes 2 new tests)
  • cargo test --lib core::observability — 17 passed, 0 failed (contract intact)
  • New regression tests:
    • sustained_outage_for_network_unreachable_classifies_as_expected — pins the BH fix
    • sustained_outage_for_actionable_server_error_does_not_classify — pins the OPENHUMAN-TAURI-8M invariant (real outages still surface)

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced connection failure escalation and reporting for sustained outages with improved error routing and additional diagnostic metadata in failure notifications.
  • Tests

    • Added regression tests to validate error classification for network unreachability and server errors in connection failure handling scenarios.

Review Change Stack

…classifier (OPENHUMAN-TAURI-BH)

The one-shot `log::error!` added in OPENHUMAN-TAURI-8M (tinyhumansai#1568) correctly
collapses retry storms to a single Sentry event per affected client, but
it fires on every transport-level failure shape — including offline users
that just hit `Network is unreachable (os error 51)`. Sentry has no
signal to act on a user being offline (no status, no trace, no payload),
so each event was pure noise.

`src/core/observability.rs::expected_error_kind` already classifies the
network-unreachable substring (added for OPENHUMAN-TAURI-32), and six
other call sites — `web_channel.run_chat_task`, `integrations.client`,
`providers/reliable`, `agent.harness.tool_loop`, `agent.harness.runtime`
— already route their transport errors through `report_error_or_expected`.
The socket loop was the odd one out.

Switching the threshold branch to `report_error_or_expected` keeps the
OPENHUMAN-TAURI-8M intent (one Sentry event per genuine sustained outage,
e.g. gateway 5xx or malformed handshake) while demoting environment-only
shapes to warn-level breadcrumbs. Two regression tests pin the wire
format to the classifier match so neither side can drift silently.

Fixes OPENHUMAN-TAURI-BH
@CodeGhost21 CodeGhost21 requested a review from a team May 13, 2026 18:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 083b54f2-3c6c-437e-9334-d306392b8cec

📥 Commits

Reviewing files that changed from the base of the PR and between 39f641d and f547b38.

📒 Files selected for processing (2)
  • src/openhuman/socket/ws_loop.rs
  • src/openhuman/socket/ws_loop_tests.rs

📝 Walkthrough

Walkthrough

When consecutive WebSocket connection failures reach the escalation threshold, the sustained-outage notification is routed through report_error_or_expected with a formatted message and attempts metadata, replacing a direct error log. Regression tests verify correct classification behavior for network-unreachable and server-error sustained-outage messages.

Changes

Sustained-Outage Escalation Routing and Tests

Layer / File(s) Summary
Sustained-outage escalation routing
src/openhuman/socket/ws_loop.rs, src/openhuman/socket/ws_loop_tests.rs
log_connection_failure at the failure threshold now routes the sustained-outage message through crate::core::observability::report_error_or_expected with formatted text and an attempts metadata field, replacing direct logging. Two regression tests verify that network-unreachable sustained-outage messages classify as ExpectedErrorKind::NetworkUnreachable and that actionable server-error messages do not classify as expected.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Possibly related PRs

  • tinyhumansai/openhuman#1601: Introduced the report_error_or_expected function and ExpectedErrorKind classification logic that this PR's escalation path now uses.
  • tinyhumansai/openhuman#1568: Previously modified log_connection_failure escalation/threshold behavior and ws_loop_tests.rs; this PR extends that with the observability routing change.
  • tinyhumansai/openhuman#1669: Modifies the shared expected_error_kind classifier that the sustained-outage escalation tests depend on.

Suggested labels

working

Suggested reviewers

  • senamakel

Poem

A rabbit hops through logs so deep,
Where network dreams and errors creep,
Now escalations route with care,
Through observability's classified layer—
Expected or not, we'll log it fair! 🐰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: routing the sustained-outage escalation through the observability classifier in the socket module, which is the core modification across both file changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 13, 2026
@senamakel senamakel merged commit ca0920f into tinyhumansai:main May 13, 2026
24 of 27 checks passed
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants