feat: Implement Smart Retry Logic with Exponential Backoff for Installations #658

KrishnaShuk · 2026-01-20T06:40:05Z

Related Issue

Closes #43

Summary

This PR implements a robust "Smart Retry" mechanism for the Cortex CLI to improve reliability during installations by automatically handling transient failures.

SmartRetry Utility: Introduced cortex/utils/retry.py with exponential backoff and intelligent error classification to retry transient issues (like network timeouts) while failing fast on permanent errors (like permission denied or disk space), ensuring no time is wasted on unrecoverable failures.
Integration & Visibility: Integrated the retry logic into InstallationCoordinator
and updated the status callback to print visible warnings ("⚠️ Transient error detected...") to stdout, keeping users informed of delays.
Configuration & Documentation: Configured the CLI to default to 5 retries and updated docs/COMMANDS.md to officially document this new reliability feature.
Testing: Added comprehensive unit tests in tests/test_retry.py with 100% coverage and updated existing coordinator tests to mock sleep for efficient execution.

AI Disclosure

No AI used
AI/IDE/Agents used (please describe below)
Google antigravity AI coding assistant(Claude Opus 4.5 ) was used for framing better tests and creating the command file.

Demonstration

I have manipulated coordinator.py for producing below behaviour.

Screencast.from.2026-01-22.12-59-37.webm

Checklist

PR title follows format: type(scope): description or [scope] description
Tests pass (pytest tests/)
MVP label added if closing MVP issue
Update "Cortex -h" (if needed)

Summary by CodeRabbit

New Features
- Installation now automatically retries transient failures with exponential backoff and progress/status callbacks (configurable, default max attempts: 5); permanent errors fail fast.
Tests
- Added tests covering success, transient and permanent failures, exceeded retries, exception-based retries, and status-callback notifications.
Documentation
- Install docs updated to describe the smart retry behavior and retry limits.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-20T06:40:12Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a SmartRetry utility and integrates retry-enabled command execution into InstallationCoordinator (new max_retries parameter, default 5). CLI passes max_retries=5 for sequential installs. Tests and docs updated; tests patch time.sleep.

Changes

Cohort / File(s)	Summary
Retry utility `cortex/utils/retry.py`	New `SmartRetry` class implementing configurable retries with exponential backoff, error classification (`_should_retry`), status callbacks, last_result/last_exception tracking, and logging.
Coordinator core `cortex/coordinator.py`	`InstallationCoordinator.__init__` and `from_plan` gain `max_retries`; `_execute_command` now uses `SmartRetry.run(...)` (replaces single `subprocess.run`) while retaining timeout, rollback, and progress callbacks.
CLI call site `cortex/cli.py`	Sequential installation path constructs `InstallationCoordinator(..., progress_callback=progress_callback, max_retries=5)`.
Tests `tests/test_retry.py`, `tests/test_coordinator.py`	New `tests/test_retry.py` validates retry scenarios; `tests/test_coordinator.py` patches `time.sleep`, updates test signatures to accept the mock, and adjusts one expectation.
Docs `docs/COMMANDS.md`	Added brief note describing Smart Retry Logic with exponential backoff (up to 5 attempts).

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (cortex/cli.py)
    participant Coordinator as InstallationCoordinator
    participant SmartRetry as SmartRetry
    participant Subprocess as subprocess.run
    participant Logger as Logger

    CLI->>Coordinator: instantiate(..., max_retries=5)
    Coordinator->>SmartRetry: create(max_retries, backoff_factor, status_callback)

    loop for each command
        Coordinator->>SmartRetry: run(execute_command)
        SmartRetry->>Subprocess: execute command (subprocess.run)
        alt returncode == 0
            Subprocess-->>SmartRetry: success result
            SmartRetry-->>Coordinator: return result
        else transient error (_should_retry → true)
            Subprocess-->>SmartRetry: failed result
            SmartRetry->>Logger: log retry attempt / status_callback
            SmartRetry->>SmartRetry: sleep(backoff_factor * 2^(attempt-1))
            SmartRetry->>Subprocess: retry execute command
        else permanent error (_should_retry → false)
            Subprocess-->>SmartRetry: failed result
            SmartRetry-->>Coordinator: return failed result
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Added multi-step structured installation coordinator with plan support fixes #8 #190: Modifies the same InstallationCoordinator implementation that this change extends with retry logic and constructor adjustments.

Poem

🐰 I hop through logs and backoff's beat,

one, two, then four—I won't accept defeat.
Transient storms I bravely retry,
Permanent woes I kindly pass by.
✨—A rabbit's cheer for resilient installs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Implement Smart Retry Logic with Exponential Backoff for Installations' directly and clearly summarizes the main change in the PR, following conventional commit format.
Description check	✅ Passed	The PR description includes all required sections from the template: Related Issue (`#43`), comprehensive Summary, AI Disclosure, and completed Checklist items. Content is detailed and complete.
Linked Issues check	✅ Passed	The PR fully addresses all acceptance criteria from issue `#43`: transient vs permanent failure detection via SmartRetry class, exponential backoff implementation (backoff_factor * 2^(attempt-1)), configurable max_retries (default 5), retry logging, error type differentiation, user-visible warnings (status callbacks), and comprehensive unit tests.
Out of Scope Changes check	✅ Passed	All changes directly support the Smart Retry feature: SmartRetry utility, InstallationCoordinator integration, test coverage, and documentation updates. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-20T06:40:22Z

Summary of Changes

Hello @KrishnaShuk, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the installation process by introducing a smart retry mechanism with exponential backoff. This system is designed to automatically re-attempt installation commands that fail due to temporary issues, such as network glitches or resource contention, while immediately failing on permanent errors. This will lead to more reliable and self-healing installations, reducing manual intervention.

Highlights

Introduced Smart Retry Logic: A new SmartRetry class has been implemented to handle transient errors during installations with exponential backoff.
Integrated into Installation Coordinator: The InstallationCoordinator now utilizes the SmartRetry mechanism for executing commands, making installation steps more resilient.
Configurable Retries: The maximum number of retries (max_retries) is now a configurable parameter for the InstallationCoordinator.
Error Classification: The SmartRetry class leverages an ErrorParser to intelligently distinguish between transient errors (which are retried) and permanent errors (which cause immediate failure).
New Unit Tests: Dedicated unit tests for the SmartRetry class have been added to ensure its correct functionality and error handling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

KrishnaShuk · 2026-01-20T06:40:24Z

Working on it!

gemini-code-assist

Code Review

This pull request introduces a robust smart retry mechanism with exponential backoff for installation processes, significantly improving the resilience of command execution. A new SmartRetry class has been added, which intelligently distinguishes between transient and permanent errors using ErrorParser to decide whether to retry. The InstallationCoordinator has been updated to integrate this new retry logic, and corresponding unit tests for SmartRetry have been added. While the core retry logic is well-implemented, some existing tests for InstallationCoordinator need to be updated to fully reflect and verify the new retry behavior. Additionally, the handling of subprocess.TimeoutExpired could be more explicit within the retry logic to prevent unnecessary retries.

github-actions · 2026-01-20T08:17:19Z

CLA Verification Passed

All contributors have signed the CLA.

Contributor	Signed As
@KrishnaShuk	@KrishnaShuk
@Anshgrover23	@Anshgrover23

Copilot

Pull request overview

This PR implements smart retry logic with exponential backoff for installation commands to handle transient errors more gracefully. The implementation adds a new SmartRetry class that uses error classification to distinguish between retryable transient errors (network issues, locks) and permanent failures (permission denied, missing packages).

Changes:

Added SmartRetry class with exponential backoff and error categorization
Integrated retry logic into InstallationCoordinator for command execution
Added comprehensive test coverage for retry scenarios
Added max_retries parameter to coordinator initialization

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
cortex/utils/retry.py	New retry handler with exponential backoff and error classification
cortex/coordinator.py	Integration of retry logic into command execution pipeline
tests/test_retry.py	Comprehensive test suite for retry behavior
tests/test_coordinator.py	Updated tests to mock `time.sleep` for retry integration
cortex/cli.py	Added `max_retries` parameter to coordinator instantiation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cortex/coordinator.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cortex/cli.py (1)

1037-1044: Add --max-retries CLI flag to make retry attempts user-configurable.

The CLI hard-codes max_retries=5 when instantiating InstallationCoordinator, and there is no existing --max-retries flag or configuration mechanism to override it. Users cannot customize retry behavior for installation steps. Add a --max-retries argument to install_parser (default 5) and wire it through the install method to InstallationCoordinator.

🛠️ Suggested wiring

-                coordinator = InstallationCoordinator(
+                coordinator = InstallationCoordinator(
                     commands=commands,
                     descriptions=[f"Step {i + 1}" for i in range(len(commands))],
                     timeout=300,
                     stop_on_error=True,
                     progress_callback=progress_callback,
-                    max_retries=5,
+                    max_retries=max_retries,
                 )

Update install method signature:

     def install(
         self,
         software: str,
         execute: bool = False,
         dry_run: bool = False,
         parallel: bool = False,
+        max_retries: int = 5,
         json_output: bool = False,
     ):

Add to install_parser:

+install_parser.add_argument(
+    "--max-retries",
+    type=int,
+    default=5,
+    help="Maximum retry attempts for installation steps (default: 5)",
+)

Update dispatch call:

         elif args.command == "install":
             return cli.install(
                 args.software,
                 execute=args.execute,
                 dry_run=args.dry_run,
                 parallel=args.parallel,
+                max_retries=args.max_retries,
             )

🤖 Fix all issues with AI agents

In `@cortex/coordinator.py`:
- Around line 191-199: status_callback currently only calls self._log (which
writes to optional log_file) so retry updates are silent in the default CLI;
modify status_callback used for SmartRetry to also emit CLI-visible output
(e.g., call an existing CLI progress callback or print to stdout) so retry
progress is visible to users. Locate status_callback and update it to call both
self._log(msg) and the CLI-facing notifier (e.g., self._progress_callback(msg)
or a console print) before passing it into SmartRetry.

In `@cortex/utils/retry.py`:
- Around line 17-26: Validate inputs in the __init__ method: ensure max_retries
is an int >= 0 and backoff_factor is a float >= 0.0, and raise a clear
ValueError if either is negative or of the wrong type to fail fast (this
prevents negative durations passed to time.sleep later). Set self.max_retries
and self.backoff_factor only after validation; keep self.status_callback and
self.error_parser initialization unchanged.
- Around line 94-115: The current logic retries by default for unlisted
categories causing permanent errors (e.g., DISK_SPACE) to be retried; update the
retry decision in the function that calls error_parser.parse_error (uses
analysis = self.error_parser.parse_error(error_message) and reads
analysis.primary_category) to treat non-fixable errors as permanent: first check
analysis.is_fixable and immediately return False when is_fixable is False, then
keep the existing retry-on-transient list (ErrorCategory.NETWORK_ERROR,
LOCK_ERROR, UNKNOWN) and the permanent list (add DISK_SPACE plus other permanent
categories like PERMISSION_DENIED, PACKAGE_NOT_FOUND, CONFIGURATION_ERROR,
DEPENDENCY_MISSING, CONFLICT). This ensures permanent categories are not retried
while preserving current transient retry behavior.
- Around line 63-66: In the except block that sets last_exception and calls
self._should_retry (the block capturing "except Exception as e"), replace "raise
e" with a bare "raise" so the original traceback is preserved; locate the
handler that assigns last_exception and invokes self._should_retry(str(e)) and
change the re-raise to a bare raise to keep the original stack trace for
debugging.

🧹 Nitpick comments (1)

tests/test_retry.py (1)
9-13: Add explicit return type hints on test methods.
Type hints are required in Python files; please add -> None to setUp and each test_* method to keep the new test module compliant. As per coding guidelines, type hints are required in Python code.
♻️ Suggested update
-    def setUp(self):
+    def setUp(self) -> None:
         self.retry = SmartRetry(max_retries=3, backoff_factor=0.01)

-    def test_success_first_try(self):
+    def test_success_first_try(self) -> None:

cortex/coordinator.py

cortex/utils/retry.py

Anshgrover23

@KrishnaShuk Kindly address all coderabbit comments.

KrishnaShuk · 2026-01-20T10:06:14Z

will do it under a day.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@cortex/coordinator.py`:
- Around line 182-194: Add explicit type annotations to the inner helpers:
annotate run_cmd with a return type of subprocess.CompletedProcess[str] (i.e.,
def run_cmd() -> subprocess.CompletedProcess[str]:) and annotate status_callback
to return None (i.e., def status_callback(msg: str) -> None:). Ensure
subprocess.CompletedProcess is referenced via the subprocess module (or
imported) so the annotation resolves.

♻️ Duplicate comments (1)

cortex/utils/retry.py (1)

68-88: Preserve the original traceback when re-raising after retries.
Raising last_exception at the end loses the original traceback, which makes debugging harder after multiple retries. Consider storing exc_info and re-raising with its original traceback.

🛠️ Proposed fix

+import sys
@@
-        last_exception = None
+        last_exc_info = None
         last_result = None
@@
-            except Exception as e:
-                last_exception = e
+            except Exception as e:
+                last_exc_info = sys.exc_info()
                 if not self._should_retry(str(e)):
                     raise
@@
-        if last_exception:
-            raise last_exception
+        if last_exc_info:
+            _, exc, tb = last_exc_info
+            raise exc.with_traceback(tb)
         return last_result

🧹 Nitpick comments (1)

cortex/coordinator.py (1)

55-66: Document the new max_retries parameter in public docstrings.
Both __init__ and from_plan are public entry points; updating their docstrings keeps the API change discoverable.

Also applies to: 86-98

cortex/coordinator.py

KrishnaShuk · 2026-01-22T07:49:08Z

PR passes all acceptance criteria.
Resolved all comments.
Added demonstration video.
Passing all tests.

@Anshgrover23 PR is ready to be merged!

sonarqubecloud · 2026-01-22T11:40:35Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Anshgrover23

@KrishnaShuk added some comments, also there is missing test for ErrorCategory integration, add them.

Also, two things are missing from requirements:

Different strategies for different error types
Documentation with configuration examples

Anshgrover23 · 2026-01-22T12:00:05Z

cortex/utils/retry.py

+from collections.abc import Callable
+from typing import Any
+
+from cortex.error_parser import ErrorCategory, ErrorParser


ErrorParser class is unused, remove that.

Anshgrover23 · 2026-01-22T12:01:41Z

cortex/cli.py

                    timeout=300,
                    stop_on_error=True,
                    progress_callback=progress_callback,
+                    max_retries=5,


Do not use magic numbers add a constant instead something like = DEFAULT_MAX_RETRIES = 5

Anshgrover23 · 2026-01-22T12:02:40Z

cortex/utils/retry.py

+        if not error_message:
+            # If no error message, assume it's a generic failure that might be transient
+            return True


This is risky. A command could fail with returncode != 0 and empty stderr/stdout for permanent reasons. Consider logging a warning here.

Anshgrover23 · 2026-01-22T12:03:14Z

cortex/utils/retry.py

+                if hasattr(result, "stderr") and result.stderr:
+                    error_msg = result.stderr
+                elif hasattr(result, "stdout") and result.stdout:
+                    error_msg = result.stdout


Stdout may contain success output, not error info. This could cause false retry classification. So fix it.

Anshgrover23 · 2026-01-22T12:03:33Z

cortex/utils/retry.py

+            attempt += 1
+            sleep_time = self.backoff_factor * (2 ** (attempt - 1))
+
+            msg = f"⚠️ Transient error detected. Retrying in {sleep_time}s... (Attempt {attempt}/{self.max_retries})"


On first retry, attempt=1, so message shows "Attempt 1/5" but it's actually the 2nd execution. Consider clarifying: "Retry 1/5" instead.

Anshgrover23 · 2026-01-22T12:05:15Z

cortex/coordinator.py

+        def status_callback(msg: str) -> None:
+            self._log(msg)
+            # Also print to stdout so the user sees the retry happening
+            print(msg)


If progress_callback is also set, user may see duplicate output. Consider checking if a callback exists before printing.

Anshgrover23 · 2026-01-22T12:06:07Z

cortex/utils/retry.py

+    ):
+        if not isinstance(max_retries, int) or max_retries < 0:
+            raise ValueError("max_retries must be a non-negative integer")
+        if not isinstance(backoff_factor, (int, float)) or backoff_factor < 0:


Should reject backoff_factor=0 as well since it defeats the purpose of backoff.

added retry functionality

c3e5ac7

[autofix.ci] apply automated fixes

769bedc

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

Anshgrover23 marked this pull request as ready for review January 20, 2026 08:17

Anshgrover23 requested review from Anshgrover23, Suyashd999 and mikejmorgan-ai as code owners January 20, 2026 08:17

Copilot AI review requested due to automatic review settings January 20, 2026 08:17

Merge branch 'main' into feat/retryFunctionality

d1c7baf

Copilot AI reviewed Jan 20, 2026

View reviewed changes

cortex/coordinator.py Outdated Show resolved Hide resolved

Anshgrover23 assigned KrishnaShuk Jan 20, 2026

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

cortex/coordinator.py Outdated Show resolved Hide resolved

cortex/utils/retry.py Show resolved Hide resolved

cortex/utils/retry.py Outdated Show resolved Hide resolved

cortex/utils/retry.py Show resolved Hide resolved

Anshgrover23 requested changes Jan 20, 2026

View reviewed changes

Anshgrover23 and others added 4 commits January 21, 2026 22:29

Merge branch 'main' into feat/retryFunctionality

b74b3e5

added documentation

840f3ef

minor fixes

ae299f1

[autofix.ci] apply automated fixes

9046057

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

cortex/coordinator.py Outdated Show resolved Hide resolved

minor fix

af526dc

KrishnaShuk requested a review from Anshgrover23 January 22, 2026 09:40

Merge branch 'main' into feat/retryFunctionality

b66ebf7

Anshgrover23 requested changes Jan 22, 2026

View reviewed changes

Uh oh!

feat: Implement Smart Retry Logic with Exponential Backoff for Installations #658

Are you sure you want to change the base?

feat: Implement Smart Retry Logic with Exponential Backoff for Installations #658

Conversation

KrishnaShuk commented Jan 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Summary

AI Disclosure

Demonstration

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

KrishnaShuk commented Jan 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Jan 20, 2026

CLA Verification Passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Anshgrover23 left a comment

Choose a reason for hiding this comment

Uh oh!

KrishnaShuk commented Jan 20, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KrishnaShuk commented Jan 22, 2026

Uh oh!

sonarqubecloud bot commented Jan 22, 2026

Quality Gate passed

Uh oh!

Anshgrover23 left a comment

Choose a reason for hiding this comment

Uh oh!

Anshgrover23 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Anshgrover23 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Anshgrover23 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Anshgrover23 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

KrishnaShuk commented Jan 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading