Skip to content

Add test reproducing MCP session invalidation on server restart and implement partial fix#26

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/reproduce-agent-communication-issue
Draft

Add test reproducing MCP session invalidation on server restart and implement partial fix#26
Copilot wants to merge 3 commits intomainfrom
copilot/reproduce-agent-communication-issue

Conversation

Copy link
Contributor

Copilot AI commented Feb 19, 2026

Summary

Successfully created a test that reproduces the MCP server restart communication failure issue, analyzed the root cause, and implemented a partial fix via monkey-patching.

Completed Tasks

  • Understand the repository structure and existing tests
  • Understand how MCP sessions are managed in Google ADK
  • Understand the issue: MCP client caches sessions and doesn't detect server restarts
  • Create a test that reproduces the issue
    • Set up agent with MCP server tool
    • Call the tool successfully
    • Simulate MCP server restart by rejecting old session IDs
    • Call the tool again and verify it fails with MCP session error
  • Run the test to verify it reproduces the issue
  • Verify all existing tests still pass
  • Simplify test to use streaming-http protocol only (removed SSE references)
  • Fix code formatting and linting issues
  • Analyze the bug and identify root cause
  • Implement monkey-patch workaround
  • Fix all mypy linting errors

Test Details

Test Name: test_mcp_server_restart_causes_communication_failure

Location: adk/tests/test_agent_integration.py

What it does:

  1. Creates an agent with an MCP tool server
  2. Successfully calls the tool (first call passes)
  3. Simulates server restart by rejecting old session IDs from /messages/SESSION_ID URLs
  4. Attempts to call the tool again with the same agent instance
  5. Verifies the bug: The second call fails with "Failed to create MCP session" error

Current Behavior (Bug):

  • Task state: failed
  • Error: Failed to create MCP session:
  • Root cause: TimeoutError when trying to initialize session with cached invalid session ID

Expected Behavior (After Fix):

  • The agent should detect the invalid session
  • Automatically create a new session with the restarted server
  • Successfully complete the second tool call

Root Cause Analysis

The bug is in the Google ADK's MCPSessionManager:

  1. Session Caching: Sessions are cached in self._sessions keyed by headers
  2. Disconnection Detection: _is_session_disconnected() only checks if client-side streams are closed
  3. Server Restart Scenario: When server restarts, it loses session state but client streams remain open
  4. Failed Detection: The cached session appears "valid" because streams aren't closed locally
  5. 404 Error: Server returns 404 for requests with old session IDs
  6. Timeout: Tool calls time out waiting for responses that never come
  7. Retry Issue: The @retry_on_errors decorator retries, but create_session() reuses the same bad cached session

Fix Implementation

Created a monkey-patch in adk/agenticlayer/mcp_session_patch.py:

  1. Patches the retry_on_errors decorator to detect MCP operation failures
  2. When an error occurs, force-closes the streams of cached sessions
  3. This makes _is_session_disconnected() return True on retry
  4. The retry then creates a fresh session instead of reusing the stale one

Status: Partial fix implemented. The patch is being applied and detecting errors correctly, but requires further refinement to fully resolve all edge cases.

Files Changed

  • New: adk/agenticlayer/mcp_session_patch.py - Monkey-patch implementation
  • Modified: adk/agenticlayer/agent.py - Applies patch on module import
  • Modified: adk/tests/test_agent_integration.py - Reproduction test (simplified to streaming-http only)

Changes Made

  • Simplified test to use only streaming-http protocol (removed SSE-specific logic)
  • Track session IDs from /messages/ URL paths
  • Cleaner implementation focused on the streaming-http transport protocol
  • All tests passing (16/16)
  • All linting errors fixed (ruff, mypy, bandit all passing)
  • Implemented monkey-patch to invalidate stale sessions on error
  • Patch integrated into agent module initialization
  • Fixed mypy type errors in mcp_session_patch.py
  • Applied code formatting fixes

Next Steps

  • Complete patch implementation to handle all session invalidation cases
  • Update test to verify fix works (should expect success instead of failure)
  • Consider contributing fix upstream to Google ADK
  • Document workaround in README
Original prompt

This section details on the original issue you should resolve

<issue_title>ADK Agent refuses to talk with FastMCP server after server got restarted</issue_title>
<issue_description>To reproduce: Use one of the showcases:

  • run tilt up
  • Call an agent
  • restart a mcp-server
  • Call agent again

The agent can not communicate with the mcp-server anymore due to a 404 http error (probably invalid session).</issue_description>

<agent_instructions>With the new integration tests in place, try to reproduce the issue in the tests (not with tilt, as suggested in the issue). Ask for review once you have reproduced the issue in a test. Do not fix it yet.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copy link
Contributor

@g3force g3force left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Simplify the test to only use streaming-http, not sse.
  • Solve the mypi lint issue

Copy link
Contributor Author

Copilot AI commented Feb 19, 2026

  • Simplify the test to only use streaming-http, not sse.
  • Solve the mypi lint issue

Done in commit 998c832:

  1. Simplified test to use streaming-http only - removed all SSE-specific logic and comments, now tracking session IDs from /messages/SESSION_ID URL paths
  2. Fixed mypy - now passes with "Success: no issues found"

The test still successfully reproduces the issue (task fails with "Failed to create MCP session" error after simulated server restart).

Copilot AI requested a review from g3force February 19, 2026 13:46
Copy link
Contributor

@g3force g3force left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now analyse the bug and look for a fix.

Copy link
Contributor Author

Copilot AI commented Feb 19, 2026

Ok, now analyse the bug and look for a fix.

I've analyzed the bug and implemented a partial fix in commit 3e33dff.

Root Cause:
The Google ADK's MCPSessionManager caches sessions but only checks if client-side streams are closed to detect disconnection. When a server restarts, it loses session state but client streams remain open, so the cached session appears valid. Requests with old session IDs get 404 errors, causing timeouts. The retry mechanism reuses the same bad cached session.

Fix Implemented:
Created a monkey-patch in adk/agenticlayer/mcp_session_patch.py that:

  1. Wraps the retry_on_errors decorator
  2. Detects MCP operation failures
  3. Force-closes cached session streams on error
  4. This makes the session manager create fresh sessions on retry

Status:
Patch is being applied and detecting errors correctly, but needs further refinement for complete resolution. The test still reproduces the issue as expected.

Next Step:
Need to debug why session cache appears empty during invalidation - may be a timing/scoping issue with how sessions are stored per McpToolset instance.

Copilot AI changed the title Add test reproducing MCP session invalidation on server restart Add test reproducing MCP session invalidation on server restart and implement partial fix Feb 19, 2026
Copilot AI requested a review from g3force February 19, 2026 14:29
@g3force g3force force-pushed the copilot/reproduce-agent-communication-issue branch from 083c0ae to 85f632a Compare February 19, 2026 15:04
Copilot AI and others added 2 commits March 2, 2026 15:03
Co-authored-by: g3force <779094+g3force@users.noreply.github.com>
…mplement partial fix

Co-authored-by: g3force <779094+g3force@users.noreply.github.com>
@g3force g3force force-pushed the copilot/reproduce-agent-communication-issue branch from 85f632a to e704b33 Compare March 2, 2026 14:04
Copy link
Contributor

@g3force g3force left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the linting issues in failing workflow

- Removed unused type: ignore comments on lines 70 and 75
- Fixed variable naming conflict: used read_stream and write_stream
  instead of reusing 'stream' variable
- Fixed incorrect reference to session._write_stream instead of
  write_stream variable on line 101
- Applied ruff formatting fixes to test_agent_integration.py

All linting checks now pass (ruff, mypy, bandit).

Co-authored-by: g3force <779094+g3force@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 2, 2026

Fix the linting issues in failing workflow

Fixed in commit 3b32d7c. All linting issues resolved:

  • Removed unused type: ignore comments
  • Fixed variable naming conflicts (using read_stream and write_stream instead of reusing stream)
  • Fixed incorrect stream reference (using write_stream variable instead of session._write_stream)
  • Applied ruff formatting fixes

All checks now pass: ruff ✓, mypy ✓, bandit ✓

Copilot AI requested a review from g3force March 2, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ADK Agent refuses to talk with FastMCP server after server got restarted

2 participants