fix(ci): deflake smoke tests for Google models by michaelneale · Pull Request #7344 · block/goose

michaelneale · 2026-02-19T06:02:36Z

Problem

The Live Provider Tests Smoke Tests have been flaking heavily on Google models (gemini-2.5-pro and gemini-3-flash-preview), causing ~75% of all Smoke Test failures across PRs. These are all flakes — re-triggered runs pass.

The root cause is the uppercase transformation prompt:

Use the text_editor view command to read ./input.txt, then output this file's contents in UPPERCASE

Gemini models interpret "output in UPPERCASE" as a style instruction for their response rather than a transformation of specific file content. They read the file correctly but then hallucinate uppercase text instead:

HELLO. I AM A LARGE LANGUAGE MODEL. I AM GOOSE...
HELLO WORLD. THIS IS A TEST. THE QUICK BROWN FOX...
INPUT.TXT-ABC123 (uppercased the filename, not the content)

Fix

Replace the uppercase transformation test with a two-file read-back test using random tokens per run (smoke-alpha-$RANDOM, smoke-bravo-$RANDOM).

This still verifies:

Tool use — text_editor must be called (same grep check as before)
Actual file reading — random tokens can't be guessed or hallucinated, so their presence in the output proves the model read the files

No transformation needed, just echo back. The prompt asks the model to reply with ONLY the file contents — much less ambiguous than asking for a case transformation.

The uppercase transformation prompt ('output this file's contents in UPPERCASE') was ambiguous enough that Gemini models would frequently hallucinate uppercase text instead of uppercasing the actual file content (e.g. 'HELLO. I AM A LARGE LANGUAGE MODEL. I AM GOOSE...') or uppercase the filename instead of the contents. Replace with a two-file read-back test using random tokens per run. This still verifies tool use (text_editor must be called) and proves the model read the file contents (random tokens can't be guessed), without requiring a transformation that trips up the models.

Copilot

Pull request overview

This PR deflakes the Live Provider “Smoke Tests” for Google Gemini models by replacing an ambiguous uppercase-transformation prompt with a deterministic two-file read-back prompt using per-run random tokens, ensuring the model must actually read file contents rather than follow a stylistic instruction.

Changes:

Replace the uppercase transformation smoke test with a two-file read-back test using unique per-run tokens.
Update validation to check that both random tokens are present in the model output (and that text_editor was used).

scripts/test_providers.sh

DOsinga

nice one. just wanted to start working on this

* origin/main: fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) chore: upgrade to rmcp 0.16.0 (#7274) docs: add monitoring subagent activity section (#7323) docs: document Desktop UI recipe editing for model/provider and extensions (#7327) docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330) fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321) docs: add Permission Policy documentation for MCP Apps (#7325) update RPI plan prompt (#7326) docs: add CLI syntax highlighting theme customization (#7324) fix(cli): replace shell-based update with native Rust implementation (#7148) docs: rename Code Execution extension to Code Mode extension (#7316)

* origin/main: (29 commits) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) chore: upgrade to rmcp 0.16.0 (#7274) docs: add monitoring subagent activity section (#7323) docs: document Desktop UI recipe editing for model/provider and extensions (#7327) docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330) fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321) docs: add Permission Policy documentation for MCP Apps (#7325) update RPI plan prompt (#7326) docs: add CLI syntax highlighting theme customization (#7324) fix(cli): replace shell-based update with native Rust implementation (#7148) docs: rename Code Execution extension to Code Mode extension (#7316) docs: remove ALPHA_FEATURES flag from documentation (#7315) docs: escape variable syntax in recipes (#7314) ... # Conflicts: # ui/desktop/src/components/McpApps/McpAppRenderer.tsx # ui/desktop/src/components/McpApps/types.ts

* 'main' of github.com:block/goose: (24 commits) Docs: claude code uses stream-json (#7358) Improve link confirmation modal (#7333) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) chore: upgrade to rmcp 0.16.0 (#7274) docs: add monitoring subagent activity section (#7323) docs: document Desktop UI recipe editing for model/provider and extensions (#7327) docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330) fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321) docs: add Permission Policy documentation for MCP Apps (#7325) update RPI plan prompt (#7326) docs: add CLI syntax highlighting theme customization (#7324) fix(cli): replace shell-based update with native Rust implementation (#7148) docs: rename Code Execution extension to Code Mode extension (#7316) ...

michaelneale · 2026-02-19T23:23:20Z

@DOsinga the perfect one for an agent to solve - just told it to look through history for patterns, make a branch and try some things (and made sure it stayed true to intent of test!)

* main: (46 commits) chore(deps): bump hono from 4.11.9 to 4.12.0 in /ui/desktop (#7369) Include 3rd-party license copy for JavaScript/CSS minified files (#7352) docs for reasoning env var (#7367) docs: update skills detail page to reference Goose Summon extension (#7350) fix(apps): restore MCP app sampling support reverted by #6933 (#7366) feat: TUI client of goose-acp (#7362) docs: agent variable (#7365) docs: pass env vars to shell (#7361) docs: update sandbox topic (#7336) feat: add local inference provider with llama.cpp backend and HuggingFace model management (#6933) Docs: claude code uses stream-json (#7358) Improve link confirmation modal (#7333) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) ...

* origin/main: (62 commits) Docs: claude code uses stream-json (#7358) Improve link confirmation modal (#7333) fix(ci): deflake smoke tests for Google models (#7344) feat: add Cerebras provider support (#7339) fix: skip whitespace-only text blocks in Anthropic message (#7343) fix(goose-acp): heap allocations (#7322) Remove trailing space from links (#7156) fix: detect low balance and prompt for top up (#7166) feat(apps): add support for MCP apps to sample (#7039) Typescript SDK for ACP extension methods (#7319) chore: upgrade to rmcp 0.16.0 (#7274) docs: add monitoring subagent activity section (#7323) docs: document Desktop UI recipe editing for model/provider and extensions (#7327) docs: add CLAUDE_THINKING_BUDGET and CLAUDE_THINKING_ENABLED environm… (#7330) fix: display 'Code Mode' instead of 'code_execution' in CLI (#7321) docs: add Permission Policy documentation for MCP Apps (#7325) update RPI plan prompt (#7326) docs: add CLI syntax highlighting theme customization (#7324) fix(cli): replace shell-based update with native Rust implementation (#7148) docs: rename Code Execution extension to Code Mode extension (#7316) ... # Conflicts: # crates/goose-server/src/main.rs

Copilot AI review requested due to automatic review settings February 19, 2026 06:02

Copilot started reviewing on behalf of michaelneale February 19, 2026 06:03 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

scripts/test_providers.sh Outdated Show resolved Hide resolved

fix: correct comment to match actual validation

2f1be0e

michaelneale requested a review from DOsinga February 19, 2026 06:20

DOsinga approved these changes Feb 19, 2026

View reviewed changes

DOsinga added this pull request to the merge queue Feb 19, 2026

Merged via the queue into main with commit c324cd3 Feb 19, 2026
20 checks passed

DOsinga deleted the micn/deflake-smoke-test-google branch February 19, 2026 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): deflake smoke tests for Google models#7344

fix(ci): deflake smoke tests for Google models#7344
DOsinga merged 2 commits intomainfrom
micn/deflake-smoke-test-google

michaelneale commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

DOsinga left a comment

Uh oh!

Uh oh!

michaelneale commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

michaelneale commented Feb 19, 2026

Problem

Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

DOsinga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelneale commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments