Skip to content

feat: opt-in fail-loud when AGENTMEMORY_URL is unreachable#273

Open
TotalZack wants to merge 2 commits into
rohitg00:mainfrom
TotalZack:feat/remote-required-no-silent-fallback
Open

feat: opt-in fail-loud when AGENTMEMORY_URL is unreachable#273
TotalZack wants to merge 2 commits into
rohitg00:mainfrom
TotalZack:feat/remote-required-no-silent-fallback

Conversation

@TotalZack
Copy link
Copy Markdown

@TotalZack TotalZack commented May 10, 2026

Problem

@agentmemory/mcp's standalone shim probes /agentmemory/livez and silently falls back to local in-memory mode on probe failure. Users on multi-machine self-hosted setups (Tailscale, Coolify) have no signal that their memories are landing in a local store rather than the configured shared backend.

Fix

Two new env vars, both default-off (no behavior change for existing users), plus a visible stderr line on the previously-silent fallback path:

Env var Default Effect
AGENTMEMORY_REMOTE_REQUIRED unset When 1 / true / yes / on: probe failure throws a structured RemoteUnreachableError instead of falling back to local. Recommended for production multi-machine deployments.
AGENTMEMORY_LIVEZ_TIMEOUT_MS 2000 ms (same default as AGENTMEMORY_PROBE_TIMEOUT_MS since v0.9.7) Override the probe timeout. Takes precedence over AGENTMEMORY_PROBE_TIMEOUT_MS so existing deployments that already set this env var continue to work unchanged.

Plus: when REMOTE_REQUIRED is unset and probe fails, the shim now writes a single line to stderr — with url, probeTimeoutMs, and a hint at both env vars.

Three operator-friendly modes

Setup Configuration
Indie dev (default, unchanged) env unset → try remote, fall back to local with a visible stderr line if probe fails
Strict production multi-machine AGENTMEMORY_REMOTE_REQUIRED=1 → throw RemoteUnreachableError if probe fails
Local-only AGENTMEMORY_URL unset or unreachable, no REMOTE_REQUIRED → local-only (existing behavior)

Changes vs v0.9.9

  • src/mcp/rest-proxy.ts: remoteRequired() helper + RemoteUnreachableError class + merged probeTimeoutMs() that checks AGENTMEMORY_LIVEZ_TIMEOUT_MS first, then AGENTMEMORY_PROBE_TIMEOUT_MS; resolveHandle() throws on probe failure when REMOTE_REQUIRED is set; fallback path emits process.stderr.write (matching upstream's v0.9.7 style) instead of logger.warn
  • test/mcp-standalone-proxy.test.ts: 3 vitest cases covering each operator mode

Compatibility

  • No behavior change for users with no env vars set
  • AGENTMEMORY_LIVEZ_TIMEOUT_MS is a transparent alias — deployments using the old name keep working
  • Rebased onto v0.9.9

Summary by CodeRabbit

  • Documentation

    • Updated changelog and README with guidance and example env settings for remote-backend resilience.
  • New Features

    • Added AGENTMEMORY_LIVEZ_TIMEOUT_MS to customize remote health-probe timeout.
    • Added AGENTMEMORY_REMOTE_REQUIRED to opt into fail‑loud behavior instead of silent local fallback.
    • Enhanced stderr warning when remote /livez is unreachable (includes timeout and env var hints).
  • Tests

    • Added tests covering timeout, fallback, and fail‑loud behaviors.

Review Change Stack

@vercel
Copy link
Copy Markdown

vercel Bot commented May 10, 2026

@TotalZack is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 72e961f5-ec12-4bf0-9676-086d37932687

📥 Commits

Reviewing files that changed from the base of the PR and between 882037e and 5dd1cd4.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • README.md
  • src/mcp/rest-proxy.ts
  • test/mcp-standalone-proxy.test.ts
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • README.md
  • test/mcp-standalone-proxy.test.ts
  • src/mcp/rest-proxy.ts

📝 Walkthrough

Walkthrough

Adds two env vars and a new error to control remote /agentmemory/livez probe behavior: AGENTMEMORY_LIVEZ_TIMEOUT_MS customizes probe timeout; AGENTMEMORY_REMOTE_REQUIRED toggles fail-loud vs. silent fallback; RemoteUnreachableError is exported. Tests and docs updated.

Changes

Remote Backend Resilience Configuration

Layer / File(s) Summary
Configuration parsing and probe timeout
src/mcp/rest-proxy.ts
Read AGENTMEMORY_LIVEZ_TIMEOUT_MS (fallback to existing probe env var), validate numeric timeout, and use AbortSignal.timeout(probeTimeoutMs()) for the /agentmemory/livez fetch; update probe error messaging to mention the new timeout env var.
RemoteUnreachableError export
src/mcp/rest-proxy.ts
Adds export class RemoteUnreachableError extends Error with public url and probeTimeoutMs fields and a message referencing AGENTMEMORY_REMOTE_REQUIRED and AGENTMEMORY_LIVEZ_TIMEOUT_MS.
AGENTMEMORY_REMOTE_REQUIRED parsing
src/mcp/rest-proxy.ts
Adds remoteRequired() interpreting truthy strings (1/true/yes/on) to enable fail-loud behavior.
resolveHandle fallback and fail-loud behavior
src/mcp/rest-proxy.ts
On probe failure, resolveHandle() throws RemoteUnreachableError(url, probeTimeoutMs()) when remote is required; otherwise logs a warning including the URL, computed timeout, and env-var hints before returning the local in-memory handle.
Tests: imports and new suite
test/mcp-standalone-proxy.test.ts
Update imports to include resolveHandle, RemoteUnreachableError, resetHandleForTests, and logger; add Vitest suite that asserts timeout aborts, silent fallback logs once, and remote-required causes RemoteUnreachableError without warning.
User-Facing Documentation
CHANGELOG.md, README.md
CHANGELOG documents the two env vars and warning behavior; README .env example adds a "Remote backend resilience" block describing AGENTMEMORY_REMOTE_REQUIRED and AGENTMEMORY_LIVEZ_TIMEOUT_MS with guidance.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

Poem

🐰 I sniff the livez gate in the night,
A timeout set to catch a slow flight.
Flip a var to shout or softly stay,
I warn, or throw, then hop to local play.
Tests, logs, and docs—resilience in sight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing an opt-in fail-loud behavior controlled by AGENTMEMORY_REMOTE_REQUIRED when the remote backend is unreachable.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@TotalZack TotalZack force-pushed the feat/remote-required-no-silent-fallback branch from 7ff8b5e to 16ab04e Compare May 10, 2026 19:50
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
test/mcp-standalone-proxy.test.ts (1)

200-212: ⚡ Quick win

Strengthen the warning-contract assertion.

This test currently checks only url. Since the warning contract now includes probeTimeoutMs and env-var guidance, asserting those fields would better lock behavior.

Suggested patch
     const [msg, fields] = warn.mock.calls[0];
     expect(msg).toMatch(/agentmemory backend unreachable/i);
     expect(fields).toMatchObject({ url: BASE });
+    expect(fields).toHaveProperty("probeTimeoutMs");
+    expect((fields as { hint?: string }).hint).toContain("AGENTMEMORY_REMOTE_REQUIRED");
+    expect((fields as { hint?: string }).hint).toContain("AGENTMEMORY_LIVEZ_TIMEOUT_MS");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/mcp-standalone-proxy.test.ts` around lines 200 - 212, The test "emits a
warn on silent local fallback when REMOTE_REQUIRED is unset" should assert the
full warning contract: after spying on logger.warn and resolving handle
(resolveHandle) verify warn was called once and expand the fields assertion
(currently checking { url: BASE }) to also include probeTimeoutMs with the
expected timeout value and the env-var guidance string (e.g. mention of
REMOTE_REQUIRED or related env var); update the test's destructured fields
assertion to expect these keys so the spec locks the warning payload shape
returned by the code that emits the "agentmemory backend unreachable" message.
src/mcp/rest-proxy.ts (1)

60-66: ⚡ Quick win

Snapshot probe settings once per resolve attempt.

probeTimeoutMs() / remoteRequired() are read multiple times in one path. Capturing once keeps probe behavior, thrown error, and warning metadata consistent for that attempt.

Suggested patch
-async function probe(url: string): Promise<boolean> {
+async function probe(url: string, timeoutMs: number): Promise<boolean> {
   try {
     const res = await fetch(`${url}/agentmemory/livez`, {
       method: "GET",
       headers: authHeader(),
-      signal: AbortSignal.timeout(probeTimeoutMs()),
+      signal: AbortSignal.timeout(timeoutMs),
     });
     return res.ok;
   } catch {
     return false;
   }
 }
@@
   if (probeInFlight) return probeInFlight;
   const url = baseUrl();
+  const timeoutMs = probeTimeoutMs();
+  const requireRemote = remoteRequired();
   probeInFlight = (async () => {
-    const up = await probe(url);
+    const up = await probe(url, timeoutMs);
@@
-    if (remoteRequired()) {
-      throw new RemoteUnreachableError(url, probeTimeoutMs());
+    if (requireRemote) {
+      throw new RemoteUnreachableError(url, timeoutMs);
     }
     logger.warn(
       "agentmemory backend unreachable; falling back to in-memory local mode",
       {
         url,
-        probeTimeoutMs: probeTimeoutMs(),
+        probeTimeoutMs: timeoutMs,
         hint: "set AGENTMEMORY_REMOTE_REQUIRED=1 to fail loud, or AGENTMEMORY_LIVEZ_TIMEOUT_MS=N to allow more probe time on high-latency networks",
       },
     );

Also applies to: 89-127

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/mcp/rest-proxy.ts` around lines 60 - 66, Probe reads probeTimeoutMs() and
remoteRequired() multiple times causing inconsistent behavior; snapshot these
values once at the start of the async function probe (e.g., const timeout =
probeTimeoutMs(); const required = remoteRequired()) and use timeout and
required everywhere in probe (for AbortSignal.timeout, error messages, and
warning metadata) so the timeout and requirement decision are consistent for
that resolve attempt; apply the same snapshotting pattern to the related logic
in the other block referenced (lines 89-127) where
probeTimeoutMs()/remoteRequired() are currently read multiple times.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 13: The changelog text uses the ambiguous env var name REMOTE_REQUIRED;
update that occurrence to the full, explicit name AGENTMEMORY_REMOTE_REQUIRED in
the sentence describing the stderr warn line so the entry reads "...when
`AGENTMEMORY_URL` points at a remote backend but the `/livez` probe fails and
`AGENTMEMORY_REMOTE_REQUIRED` is unset..." to avoid operator confusion (edit the
string in CHANGELOG.md where the env var is referenced).

In `@README.md`:
- Around line 908-911: Update the README wording to reflect that fallback now
emits a warning: in the paragraph containing "fallback can land memories in the
wrong store without warning" replace that phrase with wording such as "fallback
may land memories in the wrong store after emitting a warning" (or similar) so
the docs match runtime behavior; locate this text block in the README and change
only the phrase to indicate that a warning is emitted rather than "without
warning."

In `@src/mcp/rest-proxy.ts`:
- Around line 26-27: The RemoteUnreachableError message in rest-proxy.ts is
misleading because probe() treats both timeouts/network errors and non-2xx
responses as failures; update the message constructed for RemoteUnreachableError
to be failure-mode neutral (e.g., "did not return a successful response within
{probeTimeoutMs}ms" or "did not return a 2xx response") and include the url and
probeTimeoutMs so it covers both slow/non-responsive and immediate 4xx/5xx
cases; change the string used where RemoteUnreachableError is thrown/created
(refer to RemoteUnreachableError and probe() in this file) and keep the guidance
about AGENTMEMORY_LIVEZ_TIMEOUT_MS, backend connectivity, and
AGENTMEMORY_REMOTE_REQUIRED unchanged.

---

Nitpick comments:
In `@src/mcp/rest-proxy.ts`:
- Around line 60-66: Probe reads probeTimeoutMs() and remoteRequired() multiple
times causing inconsistent behavior; snapshot these values once at the start of
the async function probe (e.g., const timeout = probeTimeoutMs(); const required
= remoteRequired()) and use timeout and required everywhere in probe (for
AbortSignal.timeout, error messages, and warning metadata) so the timeout and
requirement decision are consistent for that resolve attempt; apply the same
snapshotting pattern to the related logic in the other block referenced (lines
89-127) where probeTimeoutMs()/remoteRequired() are currently read multiple
times.

In `@test/mcp-standalone-proxy.test.ts`:
- Around line 200-212: The test "emits a warn on silent local fallback when
REMOTE_REQUIRED is unset" should assert the full warning contract: after spying
on logger.warn and resolving handle (resolveHandle) verify warn was called once
and expand the fields assertion (currently checking { url: BASE }) to also
include probeTimeoutMs with the expected timeout value and the env-var guidance
string (e.g. mention of REMOTE_REQUIRED or related env var); update the test's
destructured fields assertion to expect these keys so the spec locks the warning
payload shape returned by the code that emits the "agentmemory backend
unreachable" message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc091e64-c9fd-4f04-b0b7-8b2e6d060db2

📥 Commits

Reviewing files that changed from the base of the PR and between 13924d2 and 7ff8b5e.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • README.md
  • src/mcp/rest-proxy.ts
  • test/mcp-standalone-proxy.test.ts

Comment thread CHANGELOG.md Outdated
Comment thread README.md Outdated
Comment thread src/mcp/rest-proxy.ts Outdated
TotalZack added a commit to TotalZack/agentmemory that referenced this pull request May 10, 2026
Three small follow-ups from CodeRabbit's pre-merge review — none blocking,
all minor wording or accuracy fixes:

- CHANGELOG.md: use full env var name `AGENTMEMORY_REMOTE_REQUIRED` instead
  of the ambiguous bare `REMOTE_REQUIRED` in the warn-line bullet.
- README.md: align the inline comment about silent fallback with the
  warning-line behavior introduced in this PR ("if the warning is missed"
  rather than "without warning", since the warn line now exists).
- src/mcp/rest-proxy.ts: make `RemoteUnreachableError`'s message
  failure-mode neutral. The previous "did not respond within Xms" wording
  was inaccurate for fast non-2xx responses; the new wording covers
  timeout, network error, and non-2xx all together.

No semantic change. 12/12 vitest cases still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: TotalZack <199677237+TotalZack@users.noreply.github.com>
@TotalZack
Copy link
Copy Markdown
Author

Just to clarify the Real-world validation section in the PR body — the Hetzner / Coolify / Tailscale specifics there describe one concrete dev environment that was used to reproduce both the silent-fallback bug and verify the fix end-to-end. The bug + fix apply to any self-hosted setup where the agentmemory backend sits behind a private overlay network (Tailscale, WireGuard, ZeroTier, in-cluster service mesh, etc.), where 500 ms is sometimes not enough probe headroom. The fix is platform-neutral.

Folgecommit 882037e addresses the three CodeRabbit nitpicks (full env var name in CHANGELOG, fallback wording in README, neutral wording for RemoteUnreachableError.message). Pushed as a separate commit so the original B1 work and the review-feedback fixes stay readable in history.

TotalZack and others added 2 commits May 12, 2026 12:14
The standalone MCP shim probes /agentmemory/livez with a hardcoded 500 ms
timeout and silently falls back to local in-memory mode if the probe
fails. On high-latency networks (Tailscale across regions, slow Coolify
container boots, packet loss), users who set AGENTMEMORY_URL=
http://your-server:3111/agentmemory have no signal that their memories
are landing in a local store instead of the configured backend.

Two new env vars, both default off (no behavior change for existing
users), plus a visible warn line on the previously-silent fallback path:

  AGENTMEMORY_REMOTE_REQUIRED=1     fail loud on probe failure (throws
                                    RemoteUnreachableError) instead of
                                    falling back. Recommended for prod
                                    multi-machine deployments.

  AGENTMEMORY_LIVEZ_TIMEOUT_MS=N    raise the probe timeout (ms) when
                                    the 500 ms default trips on a
                                    healthy but slow backend.

  (warn on fallback)                emit "[agentmemory] warn agentmemory
                                    backend unreachable; falling back to
                                    in-memory local mode" on stderr
                                    when the probe fails and
                                    REMOTE_REQUIRED is unset, so the
                                    failure mode stops being invisible.

Tests: three new vitest cases in test/mcp-standalone-proxy.test.ts
covering each path. Existing 7 cases untouched.

  - respects AGENTMEMORY_LIVEZ_TIMEOUT_MS — slow probe aborts at the
    configured timeout, falls back to local
  - emits a warn on silent local fallback when REMOTE_REQUIRED is unset
  - throws RemoteUnreachableError when REMOTE_REQUIRED=1 and probe fails

Docs: env-var section in root README updated with both new vars in the
existing comment-block style. CHANGELOG entry under [Unreleased] /
Added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: TotalZack <199677237+TotalZack@users.noreply.github.com>
Three small follow-ups from CodeRabbit's pre-merge review — none blocking,
all minor wording or accuracy fixes:

- CHANGELOG.md: use full env var name `AGENTMEMORY_REMOTE_REQUIRED` instead
  of the ambiguous bare `REMOTE_REQUIRED` in the warn-line bullet.
- README.md: align the inline comment about silent fallback with the
  warning-line behavior introduced in this PR ("if the warning is missed"
  rather than "without warning", since the warn line now exists).
- src/mcp/rest-proxy.ts: make `RemoteUnreachableError`'s message
  failure-mode neutral. The previous "did not respond within Xms" wording
  was inaccurate for fast non-2xx responses; the new wording covers
  timeout, network error, and non-2xx all together.

No semantic change. 12/12 vitest cases still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: TotalZack <199677237+TotalZack@users.noreply.github.com>
@TotalZack TotalZack force-pushed the feat/remote-required-no-silent-fallback branch from 882037e to 5dd1cd4 Compare May 12, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant