smoke test allow pass for flaky providers#6638
Merged
Conversation
Collaborator
Author
|
From conversation in Discord maybe we close this and look into the gemini issue since that seems concerning ( |
michaelneale
approved these changes
Jan 22, 2026
Collaborator
michaelneale
left a comment
There was a problem hiding this comment.
is ok for now - and will have a follow up to chase these
Contributor
Hi Zane! If the gemini3 failures are with code_execution, I've #6555 that would possibly fix the issue, as I don't see empty response issues with it locally. |
fbalicchia
pushed a commit
to fbalicchia/goose
that referenced
this pull request
Jan 23, 2026
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
tlongwell-block
added a commit
that referenced
this pull request
Jan 23, 2026
* origin/main: Fix GCP Vertex AI global endpoint support for Gemini 3 models (#6187) fix: macOS keychain infinite prompt loop (#6620) chore: reduce duplicate or unused cargo deps (#6630) feat: codex subscription support (#6600) smoke test allow pass for flaky providers (#6638) feat: Add built-in skill for goose documentation reference (#6534) Native images (#6619) docs: ml-based prompt injection detection (#6627) Strip the audience for compacting (#6646) chore(release): release version 1.21.0 (minor) (#6634) add collapsable chat nav (#6649) fix: capitalize Rust in CONTRIBUTING.md (#6640) chore(deps): bump lodash from 4.17.21 to 4.17.23 in /ui/desktop (#6623) Vibe mcp apps (#6569) Add session forking capability (#5882) chore(deps): bump lodash from 4.17.21 to 4.17.23 in /documentation (#6624) fix(docs): use named import for globby v13 (#6639) PR Code Review (#6043) fix(docs): use dynamic import for globby ESM module (#6636) # Conflicts: # Cargo.lock # crates/goose-server/src/routes/session.rs
katzdave
added a commit
that referenced
this pull request
Jan 26, 2026
…o dkatz/canonical-context * 'dkatz/canonical-provider' of github.com:block/goose: (27 commits) docs: add Remotion video creation tutorial (#6675) docs: export recipe and copy yaml (#6680) Test against fastmcp (#6666) docs: mid-session changes (#6672) Fix MCP elicitation deadlock and improve UX (#6650) chore: upgrade to rmcp 0.14.0 (#6674) [docs] add MCP-UI to MCP Apps blog (#6664) ACP get working dir from args.cwd (#6653) Optimise load config in UI (#6662) Fix GCP Vertex AI global endpoint support for Gemini 3 models (#6187) fix: macOS keychain infinite prompt loop (#6620) chore: reduce duplicate or unused cargo deps (#6630) feat: codex subscription support (#6600) smoke test allow pass for flaky providers (#6638) feat: Add built-in skill for goose documentation reference (#6534) Native images (#6619) docs: ml-based prompt injection detection (#6627) Strip the audience for compacting (#6646) chore(release): release version 1.21.0 (minor) (#6634) add collapsable chat nav (#6649) ...
This was referenced Jan 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Goose created this or we could just remove the experimentals for now.
Investigation Results
I investigated the flaky
smoke-tests-code-execjob using the GitHub CLI and found:Root Cause: Two models have inconsistent tool-calling behavior:
google:gemini-3-pro-preview- Most frequent offender (~80% of failures). Sometimes returns empty responses without making any tool calls.openrouter:nvidia/nemotron-3-nano-30b-a3b- Occasional failures with similar behavior.Pattern: When these models fail, they return nothing within ~5 seconds. When they succeed, they take ~45 seconds and properly call tools. This is typical of preview/experimental models.
Timeline:
gemini-3-pro-previewadded Nov 19, 2025nvidia/nemotron-3-nano-30b-a3badded Dec 31, 2025Fix Applied
I modified
scripts/test_providers.shto add an "allowed failures" mechanism:ALLOWED_FAILURESarray listing the flaky modelsis_allowed_failure()function to check if a model is in the list⚠ FLAKYinstead of✗ FAILEDExpected Behavior After Fix
gemini-3-pro-previewfails: Test shows⚠ google: gemini-3-pro-preview (flaky)and the job passes✗ provider: modeland the job failsThis approach: