Skip to content

Model fallbacks#1589

Merged
dgageot merged 2 commits intodocker:mainfrom
krissetto:fallback-models
Feb 5, 2026
Merged

Model fallbacks#1589
dgageot merged 2 commits intodocker:mainfrom
krissetto:fallback-models

Conversation

@krissetto
Copy link
Contributor

@krissetto krissetto commented Feb 4, 2026

Allows users to define fallback models per agent in the yaml config.

If something goes wrong calling a model, retry a few times with exp backoff + jitter or fallback to the next model in the list based on the type of error encountered

Makes cagent a more reliable platform for users, avoiding much pain and frustration and blocking workflows when an inference provider goes down

Has sane defaults to keep user configs minimal while still getting big advantages

Covers title generation as well

Minimal example

agents:
  root:
    model: anthropic/claude-opus-4-5
    fallback:
      models:
        - openai/gpt-5.2
    description: A reliable assistant with automatic failover
    instruction: You are a helpful and resilient assistant.

@krissetto krissetto changed the title [proposal] Model fallbacks Model fallbacks Feb 4, 2026
@krissetto krissetto marked this pull request as ready for review February 4, 2026 20:10
@krissetto krissetto requested a review from a team as a code owner February 4, 2026 20:10
github-actions[bot]
github-actions bot previously approved these changes Feb 4, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model Fallback Implementation Review

No issues found - This is a well-implemented feature with comprehensive testing and proper safeguards.

What Was Reviewed

This PR adds a robust model fallback system with:

  • Retry logic with exponential backoff for retryable errors (5xx, timeouts)
  • Immediate fallback switching for non-retryable errors (429, 4xx)
  • Cooldown mechanism to stick with successful fallbacks
  • Configurable retry counts and cooldown durations
  • Comprehensive test coverage (779 lines of tests)

Code Quality Highlights

Strengths:

  • ✅ Thread-safe cooldown state management with proper mutex usage
  • ✅ Proper resource cleanup (stream.Close() via defer)
  • ✅ Comprehensive error classification (retryable vs non-retryable)
  • ✅ Context cancellation handling throughout
  • ✅ Extensive test coverage including edge cases
  • ✅ Well-documented configuration with sensible defaults

Architecture:

  • Clean separation of concerns (fallback logic in dedicated file)
  • Proper use of Go idioms (defer for cleanup, mutex for thread safety)
  • Good error wrapping and logging

Verification Process

I analyzed the implementation for common bug patterns:

  • ✅ Resource leaks - None found (proper defer usage)
  • ✅ Race conditions - None found (local variables + mutex protection)
  • ✅ Bounds checking - Properly validated
  • ✅ Nil pointer dereferences - Properly guarded
  • ✅ Off-by-one errors - Calculations are correct

Conclusion

This feature adds significant value by making cagent more resilient to provider outages. The implementation is solid with proper safeguards and comprehensive testing.

@krissetto krissetto marked this pull request as draft February 4, 2026 20:43
@krissetto krissetto force-pushed the fallback-models branch 2 times, most recently from b4399bc to 4bb5ac8 Compare February 4, 2026 21:29
@krissetto krissetto marked this pull request as ready for review February 5, 2026 09:47
github-actions[bot]
github-actions bot previously approved these changes Feb 5, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

I've reviewed the model fallback implementation and found no bugs in the changed code. The implementation demonstrates solid engineering practices:

Proper concurrency control - Mutex protection for cooldown state access
Defensive programming - Nil checks and bounds validation throughout
Error handling - Clear classification of retryable vs non-retryable errors
Context handling - Proper cancellation checks in retry loops
Code organization - Clean separation of concerns between fallback logic and model switching

The feature adds robust failover capabilities with sensible defaults. Good work!

Allows users to define fallback models per agent in the yaml config.

If something goes wrong calling a model, retry a few times with exp backoff + jitter or fallback to the next model in the list based on the type of error encountered

Signed-off-by: Christopher Petito <chrisjpetito@gmail.com>
Signed-off-by: Christopher Petito <chrisjpetito@gmail.com>
@dgageot dgageot merged commit 3c6d330 into docker:main Feb 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants