Skip to content

fix(ci): deflake smoke tests for Google models#7344

Merged
DOsinga merged 2 commits intomainfrom
micn/deflake-smoke-test-google
Feb 19, 2026
Merged

fix(ci): deflake smoke tests for Google models#7344
DOsinga merged 2 commits intomainfrom
micn/deflake-smoke-test-google

Conversation

@michaelneale
Copy link
Collaborator

Problem

The Live Provider Tests Smoke Tests have been flaking heavily on Google models (gemini-2.5-pro and gemini-3-flash-preview), causing ~75% of all Smoke Test failures across PRs. These are all flakes — re-triggered runs pass.

The root cause is the uppercase transformation prompt:

Use the text_editor view command to read ./input.txt, then output this file's contents in UPPERCASE

Gemini models interpret "output in UPPERCASE" as a style instruction for their response rather than a transformation of specific file content. They read the file correctly but then hallucinate uppercase text instead:

  • HELLO. I AM A LARGE LANGUAGE MODEL. I AM GOOSE...
  • HELLO WORLD. THIS IS A TEST. THE QUICK BROWN FOX...
  • INPUT.TXT-ABC123 (uppercased the filename, not the content)

Fix

Replace the uppercase transformation test with a two-file read-back test using random tokens per run (smoke-alpha-$RANDOM, smoke-bravo-$RANDOM).

This still verifies:

  1. Tool usetext_editor must be called (same grep check as before)
  2. Actual file reading — random tokens can't be guessed or hallucinated, so their presence in the output proves the model read the files

No transformation needed, just echo back. The prompt asks the model to reply with ONLY the file contents — much less ambiguous than asking for a case transformation.

The uppercase transformation prompt ('output this file's contents in
UPPERCASE') was ambiguous enough that Gemini models would frequently
hallucinate uppercase text instead of uppercasing the actual file
content (e.g. 'HELLO. I AM A LARGE LANGUAGE MODEL. I AM GOOSE...')
or uppercase the filename instead of the contents.

Replace with a two-file read-back test using random tokens per run.
This still verifies tool use (text_editor must be called) and proves
the model read the file contents (random tokens can't be guessed),
without requiring a transformation that trips up the models.
Copilot AI review requested due to automatic review settings February 19, 2026 06:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR deflakes the Live Provider “Smoke Tests” for Google Gemini models by replacing an ambiguous uppercase-transformation prompt with a deterministic two-file read-back prompt using per-run random tokens, ensuring the model must actually read file contents rather than follow a stylistic instruction.

Changes:

  • Replace the uppercase transformation smoke test with a two-file read-back test using unique per-run tokens.
  • Update validation to check that both random tokens are present in the model output (and that text_editor was used).

@michaelneale michaelneale requested a review from DOsinga February 19, 2026 06:20
Copy link
Collaborator

@DOsinga DOsinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one. just wanted to start working on this

@DOsinga DOsinga added this pull request to the merge queue Feb 19, 2026
Merged via the queue into main with commit c324cd3 Feb 19, 2026
20 checks passed
@DOsinga DOsinga deleted the micn/deflake-smoke-test-google branch February 19, 2026 09:11
jh-block added a commit that referenced this pull request Feb 19, 2026
* origin/main:
  fix(ci): deflake smoke tests for Google models (#7344)
  feat: add Cerebras provider support (#7339)
  fix: skip whitespace-only text blocks in Anthropic message (#7343)
  fix(goose-acp): heap allocations (#7322)
  Remove trailing space from links (#7156)
  fix: detect low balance and prompt for top up (#7166)
  feat(apps): add support for MCP apps to sample (#7039)
  Typescript SDK for ACP extension methods (#7319)
  chore: upgrade to rmcp 0.16.0 (#7274)
  docs: add monitoring subagent activity section (#7323)
  docs: document Desktop UI recipe editing for model/provider and extensions (#7327)
  docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330)
  fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321)
  docs: add Permission Policy documentation for MCP Apps (#7325)
  update RPI plan prompt (#7326)
  docs: add CLI syntax highlighting theme customization (#7324)
  fix(cli): replace shell-based update with native Rust implementation (#7148)
  docs: rename Code Execution extension to Code Mode extension (#7316)
aharvard added a commit that referenced this pull request Feb 19, 2026
* origin/main: (29 commits)
  fix(ci): deflake smoke tests for Google models (#7344)
  feat: add Cerebras provider support (#7339)
  fix: skip whitespace-only text blocks in Anthropic message (#7343)
  fix(goose-acp): heap allocations (#7322)
  Remove trailing space from links (#7156)
  fix: detect low balance and prompt for top up (#7166)
  feat(apps): add support for MCP apps to sample (#7039)
  Typescript SDK for ACP extension methods (#7319)
  chore: upgrade to rmcp 0.16.0 (#7274)
  docs: add monitoring subagent activity section (#7323)
  docs: document Desktop UI recipe editing for model/provider and extensions (#7327)
  docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330)
  fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321)
  docs: add Permission Policy documentation for MCP Apps (#7325)
  update RPI plan prompt (#7326)
  docs: add CLI syntax highlighting theme customization (#7324)
  fix(cli): replace shell-based update with native Rust implementation (#7148)
  docs: rename Code Execution extension to Code Mode extension (#7316)
  docs: remove ALPHA_FEATURES flag from documentation (#7315)
  docs: escape variable syntax in recipes (#7314)
  ...

# Conflicts:
#	ui/desktop/src/components/McpApps/McpAppRenderer.tsx
#	ui/desktop/src/components/McpApps/types.ts
katzdave added a commit that referenced this pull request Feb 19, 2026
* 'main' of github.com:block/goose: (24 commits)
  Docs: claude code uses stream-json (#7358)
  Improve link confirmation modal (#7333)
  fix(ci): deflake smoke tests for Google models (#7344)
  feat: add Cerebras provider support (#7339)
  fix: skip whitespace-only text blocks in Anthropic message (#7343)
  fix(goose-acp): heap allocations (#7322)
  Remove trailing space from links (#7156)
  fix: detect low balance and prompt for top up (#7166)
  feat(apps): add support for MCP apps to sample (#7039)
  Typescript SDK for ACP extension methods (#7319)
  chore: upgrade to rmcp 0.16.0 (#7274)
  docs: add monitoring subagent activity section (#7323)
  docs: document Desktop UI recipe editing for model/provider and extensions (#7327)
  docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330)
  fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321)
  docs: add Permission Policy documentation for MCP Apps (#7325)
  update RPI plan prompt (#7326)
  docs: add CLI syntax highlighting theme customization (#7324)
  fix(cli): replace shell-based update with native Rust implementation (#7148)
  docs: rename Code Execution extension to Code Mode extension (#7316)
  ...
@michaelneale
Copy link
Collaborator Author

@DOsinga the perfect one for an agent to solve - just told it to look through history for patterns, make a branch and try some things (and made sure it stayed true to intent of test!)

michaelneale added a commit that referenced this pull request Feb 19, 2026
* main: (46 commits)
  chore(deps): bump hono from 4.11.9 to 4.12.0 in /ui/desktop (#7369)
  Include 3rd-party license copy for JavaScript/CSS minified files (#7352)
  docs for reasoning env var (#7367)
  docs: update skills detail page to reference Goose Summon extension (#7350)
  fix(apps): restore MCP app sampling support reverted by #6933 (#7366)
  feat: TUI client of goose-acp (#7362)
  docs: agent variable (#7365)
  docs: pass env vars to shell (#7361)
  docs: update sandbox topic (#7336)
  feat: add local inference provider with llama.cpp backend and HuggingFace model management (#6933)
  Docs: claude code uses stream-json (#7358)
  Improve link confirmation modal (#7333)
  fix(ci): deflake smoke tests for Google models (#7344)
  feat: add Cerebras provider support (#7339)
  fix: skip whitespace-only text blocks in Anthropic message (#7343)
  fix(goose-acp): heap allocations (#7322)
  Remove trailing space from links (#7156)
  fix: detect low balance and prompt for top up (#7166)
  feat(apps): add support for MCP apps to sample (#7039)
  Typescript SDK for ACP extension methods (#7319)
  ...
tlongwell-block added a commit that referenced this pull request Feb 20, 2026
* origin/main: (62 commits)
  Docs: claude code uses stream-json (#7358)
  Improve link confirmation modal (#7333)
  fix(ci): deflake smoke tests for Google models (#7344)
  feat: add Cerebras provider support (#7339)
  fix: skip whitespace-only text blocks in Anthropic message (#7343)
  fix(goose-acp): heap allocations (#7322)
  Remove trailing space from links (#7156)
  fix: detect low balance and prompt for top up (#7166)
  feat(apps): add support for MCP apps to sample (#7039)
  Typescript SDK for ACP extension methods (#7319)
  chore: upgrade to rmcp 0.16.0 (#7274)
  docs: add monitoring subagent activity section (#7323)
  docs: document Desktop UI recipe editing for model/provider and extensions (#7327)
  docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330)
  fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321)
  docs: add Permission Policy documentation for MCP Apps (#7325)
  update RPI plan prompt (#7326)
  docs: add CLI syntax highlighting theme customization (#7324)
  fix(cli): replace shell-based update with native Rust implementation (#7148)
  docs: rename Code Execution extension to Code Mode extension (#7316)
  ...

# Conflicts:
#	crates/goose-server/src/main.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments