Skip to content

Conversation

@KrishnaShuk
Copy link
Contributor

@KrishnaShuk KrishnaShuk commented Jan 20, 2026

Related Issue

Closes #43

Summary

This PR implements a robust "Smart Retry" mechanism for the Cortex CLI to improve reliability during installations by automatically handling transient failures.

  • SmartRetry Utility: Introduced cortex/utils/retry.py with exponential backoff and intelligent error classification to retry transient issues (like network timeouts) while failing fast on permanent errors (like permission denied or disk space), ensuring no time is wasted on unrecoverable failures.
  • Integration & Visibility: Integrated the retry logic into InstallationCoordinator
    and updated the status callback to print visible warnings ("⚠️ Transient error detected...") to stdout, keeping users informed of delays.
  • Configuration & Documentation: Configured the CLI to default to 5 retries and updated docs/COMMANDS.md to officially document this new reliability feature.
  • Testing: Added comprehensive unit tests in tests/test_retry.py with 100% coverage and updated existing coordinator tests to mock sleep for efficient execution.

AI Disclosure

  • No AI used
  • AI/IDE/Agents used (please describe below)
    Google antigravity AI coding assistant(Claude Opus 4.5 ) was used for framing better tests and creating the command file.

Demonstration

I have manipulated coordinator.py for producing below behaviour.

Screencast.from.2026-01-22.12-59-37.webm

Checklist

  • PR title follows format: type(scope): description or [scope] description
  • Tests pass (pytest tests/)
  • MVP label added if closing MVP issue
  • Update "Cortex -h" (if needed)

Summary by CodeRabbit

  • New Features
    • Installation now automatically retries transient failures with exponential backoff and progress/status callbacks (configurable, default max attempts: 5); permanent errors fail fast.
  • Tests
    • Added tests covering success, transient and permanent failures, exceeded retries, exception-based retries, and status-callback notifications.
  • Documentation
    • Install docs updated to describe the smart retry behavior and retry limits.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 20, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a SmartRetry utility and integrates retry-enabled command execution into InstallationCoordinator (new max_retries parameter, default 5). CLI passes max_retries=5 for sequential installs. Tests and docs updated; tests patch time.sleep.

Changes

Cohort / File(s) Summary
Retry utility
cortex/utils/retry.py
New SmartRetry class implementing configurable retries with exponential backoff, error classification (_should_retry), status callbacks, last_result/last_exception tracking, and logging.
Coordinator core
cortex/coordinator.py
InstallationCoordinator.__init__ and from_plan gain max_retries; _execute_command now uses SmartRetry.run(...) (replaces single subprocess.run) while retaining timeout, rollback, and progress callbacks.
CLI call site
cortex/cli.py
Sequential installation path constructs InstallationCoordinator(..., progress_callback=progress_callback, max_retries=5).
Tests
tests/test_retry.py, tests/test_coordinator.py
New tests/test_retry.py validates retry scenarios; tests/test_coordinator.py patches time.sleep, updates test signatures to accept the mock, and adjusts one expectation.
Docs
docs/COMMANDS.md
Added brief note describing Smart Retry Logic with exponential backoff (up to 5 attempts).

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (cortex/cli.py)
    participant Coordinator as InstallationCoordinator
    participant SmartRetry as SmartRetry
    participant Subprocess as subprocess.run
    participant Logger as Logger

    CLI->>Coordinator: instantiate(..., max_retries=5)
    Coordinator->>SmartRetry: create(max_retries, backoff_factor, status_callback)

    loop for each command
        Coordinator->>SmartRetry: run(execute_command)
        SmartRetry->>Subprocess: execute command (subprocess.run)
        alt returncode == 0
            Subprocess-->>SmartRetry: success result
            SmartRetry-->>Coordinator: return result
        else transient error (_should_retry → true)
            Subprocess-->>SmartRetry: failed result
            SmartRetry->>Logger: log retry attempt / status_callback
            SmartRetry->>SmartRetry: sleep(backoff_factor * 2^(attempt-1))
            SmartRetry->>Subprocess: retry execute command
        else permanent error (_should_retry → false)
            Subprocess-->>SmartRetry: failed result
            SmartRetry-->>Coordinator: return failed result
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hop through logs and backoff's beat,

one, two, then four—I won't accept defeat.
Transient storms I bravely retry,
Permanent woes I kindly pass by.
✨—A rabbit's cheer for resilient installs

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Implement Smart Retry Logic with Exponential Backoff for Installations' directly and clearly summarizes the main change in the PR, following conventional commit format.
Description check ✅ Passed The PR description includes all required sections from the template: Related Issue (#43), comprehensive Summary, AI Disclosure, and completed Checklist items. Content is detailed and complete.
Linked Issues check ✅ Passed The PR fully addresses all acceptance criteria from issue #43: transient vs permanent failure detection via SmartRetry class, exponential backoff implementation (backoff_factor * 2^(attempt-1)), configurable max_retries (default 5), retry logging, error type differentiation, user-visible warnings (status callbacks), and comprehensive unit tests.
Out of Scope Changes check ✅ Passed All changes directly support the Smart Retry feature: SmartRetry utility, InstallationCoordinator integration, test coverage, and documentation updates. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @KrishnaShuk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the installation process by introducing a smart retry mechanism with exponential backoff. This system is designed to automatically re-attempt installation commands that fail due to temporary issues, such as network glitches or resource contention, while immediately failing on permanent errors. This will lead to more reliable and self-healing installations, reducing manual intervention.

Highlights

  • Introduced Smart Retry Logic: A new SmartRetry class has been implemented to handle transient errors during installations with exponential backoff.
  • Integrated into Installation Coordinator: The InstallationCoordinator now utilizes the SmartRetry mechanism for executing commands, making installation steps more resilient.
  • Configurable Retries: The maximum number of retries (max_retries) is now a configurable parameter for the InstallationCoordinator.
  • Error Classification: The SmartRetry class leverages an ErrorParser to intelligently distinguish between transient errors (which are retried) and permanent errors (which cause immediate failure).
  • New Unit Tests: Dedicated unit tests for the SmartRetry class have been added to ensure its correct functionality and error handling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@KrishnaShuk
Copy link
Contributor Author

Working on it!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust smart retry mechanism with exponential backoff for installation processes, significantly improving the resilience of command execution. A new SmartRetry class has been added, which intelligently distinguishes between transient and permanent errors using ErrorParser to decide whether to retry. The InstallationCoordinator has been updated to integrate this new retry logic, and corresponding unit tests for SmartRetry have been added. While the core retry logic is well-implemented, some existing tests for InstallationCoordinator need to be updated to fully reflect and verify the new retry behavior. Additionally, the handling of subprocess.TimeoutExpired could be more explicit within the retry logic to prevent unnecessary retries.

@Anshgrover23 Anshgrover23 marked this pull request as ready for review January 20, 2026 08:17
Copilot AI review requested due to automatic review settings January 20, 2026 08:17
@github-actions
Copy link

CLA Verification Passed

All contributors have signed the CLA.

Contributor Signed As
@KrishnaShuk @KrishnaShuk
@Anshgrover23 @Anshgrover23

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements smart retry logic with exponential backoff for installation commands to handle transient errors more gracefully. The implementation adds a new SmartRetry class that uses error classification to distinguish between retryable transient errors (network issues, locks) and permanent failures (permission denied, missing packages).

Changes:

  • Added SmartRetry class with exponential backoff and error categorization
  • Integrated retry logic into InstallationCoordinator for command execution
  • Added comprehensive test coverage for retry scenarios
  • Added max_retries parameter to coordinator initialization

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cortex/utils/retry.py New retry handler with exponential backoff and error classification
cortex/coordinator.py Integration of retry logic into command execution pipeline
tests/test_retry.py Comprehensive test suite for retry behavior
tests/test_coordinator.py Updated tests to mock time.sleep for retry integration
cortex/cli.py Added max_retries parameter to coordinator instantiation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cortex/cli.py (1)

1037-1044: Add --max-retries CLI flag to make retry attempts user-configurable.

The CLI hard-codes max_retries=5 when instantiating InstallationCoordinator, and there is no existing --max-retries flag or configuration mechanism to override it. Users cannot customize retry behavior for installation steps. Add a --max-retries argument to install_parser (default 5) and wire it through the install method to InstallationCoordinator.

🛠️ Suggested wiring
-                coordinator = InstallationCoordinator(
+                coordinator = InstallationCoordinator(
                     commands=commands,
                     descriptions=[f"Step {i + 1}" for i in range(len(commands))],
                     timeout=300,
                     stop_on_error=True,
                     progress_callback=progress_callback,
-                    max_retries=5,
+                    max_retries=max_retries,
                 )

Update install method signature:

     def install(
         self,
         software: str,
         execute: bool = False,
         dry_run: bool = False,
         parallel: bool = False,
+        max_retries: int = 5,
         json_output: bool = False,
     ):

Add to install_parser:

+install_parser.add_argument(
+    "--max-retries",
+    type=int,
+    default=5,
+    help="Maximum retry attempts for installation steps (default: 5)",
+)

Update dispatch call:

         elif args.command == "install":
             return cli.install(
                 args.software,
                 execute=args.execute,
                 dry_run=args.dry_run,
                 parallel=args.parallel,
+                max_retries=args.max_retries,
             )
🤖 Fix all issues with AI agents
In `@cortex/coordinator.py`:
- Around line 191-199: status_callback currently only calls self._log (which
writes to optional log_file) so retry updates are silent in the default CLI;
modify status_callback used for SmartRetry to also emit CLI-visible output
(e.g., call an existing CLI progress callback or print to stdout) so retry
progress is visible to users. Locate status_callback and update it to call both
self._log(msg) and the CLI-facing notifier (e.g., self._progress_callback(msg)
or a console print) before passing it into SmartRetry.

In `@cortex/utils/retry.py`:
- Around line 17-26: Validate inputs in the __init__ method: ensure max_retries
is an int >= 0 and backoff_factor is a float >= 0.0, and raise a clear
ValueError if either is negative or of the wrong type to fail fast (this
prevents negative durations passed to time.sleep later). Set self.max_retries
and self.backoff_factor only after validation; keep self.status_callback and
self.error_parser initialization unchanged.
- Around line 94-115: The current logic retries by default for unlisted
categories causing permanent errors (e.g., DISK_SPACE) to be retried; update the
retry decision in the function that calls error_parser.parse_error (uses
analysis = self.error_parser.parse_error(error_message) and reads
analysis.primary_category) to treat non-fixable errors as permanent: first check
analysis.is_fixable and immediately return False when is_fixable is False, then
keep the existing retry-on-transient list (ErrorCategory.NETWORK_ERROR,
LOCK_ERROR, UNKNOWN) and the permanent list (add DISK_SPACE plus other permanent
categories like PERMISSION_DENIED, PACKAGE_NOT_FOUND, CONFIGURATION_ERROR,
DEPENDENCY_MISSING, CONFLICT). This ensures permanent categories are not retried
while preserving current transient retry behavior.
- Around line 63-66: In the except block that sets last_exception and calls
self._should_retry (the block capturing "except Exception as e"), replace "raise
e" with a bare "raise" so the original traceback is preserved; locate the
handler that assigns last_exception and invokes self._should_retry(str(e)) and
change the re-raise to a bare raise to keep the original stack trace for
debugging.
🧹 Nitpick comments (1)
tests/test_retry.py (1)

9-13: Add explicit return type hints on test methods.
Type hints are required in Python files; please add -> None to setUp and each test_* method to keep the new test module compliant. As per coding guidelines, type hints are required in Python code.

♻️ Suggested update
-    def setUp(self):
+    def setUp(self) -> None:
         self.retry = SmartRetry(max_retries=3, backoff_factor=0.01)

-    def test_success_first_try(self):
+    def test_success_first_try(self) -> None:

Copy link
Collaborator

@Anshgrover23 Anshgrover23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KrishnaShuk Kindly address all coderabbit comments.

@KrishnaShuk
Copy link
Contributor Author

will do it under a day.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@cortex/coordinator.py`:
- Around line 182-194: Add explicit type annotations to the inner helpers:
annotate run_cmd with a return type of subprocess.CompletedProcess[str] (i.e.,
def run_cmd() -> subprocess.CompletedProcess[str]:) and annotate status_callback
to return None (i.e., def status_callback(msg: str) -> None:). Ensure
subprocess.CompletedProcess is referenced via the subprocess module (or
imported) so the annotation resolves.
♻️ Duplicate comments (1)
cortex/utils/retry.py (1)

68-88: Preserve the original traceback when re-raising after retries.
Raising last_exception at the end loses the original traceback, which makes debugging harder after multiple retries. Consider storing exc_info and re-raising with its original traceback.

🛠️ Proposed fix
+import sys
@@
-        last_exception = None
+        last_exc_info = None
         last_result = None
@@
-            except Exception as e:
-                last_exception = e
+            except Exception as e:
+                last_exc_info = sys.exc_info()
                 if not self._should_retry(str(e)):
                     raise
@@
-        if last_exception:
-            raise last_exception
+        if last_exc_info:
+            _, exc, tb = last_exc_info
+            raise exc.with_traceback(tb)
         return last_result
🧹 Nitpick comments (1)
cortex/coordinator.py (1)

55-66: Document the new max_retries parameter in public docstrings.
Both __init__ and from_plan are public entry points; updating their docstrings keeps the API change discoverable.

Also applies to: 86-98

@KrishnaShuk
Copy link
Contributor Author

PR passes all acceptance criteria.
Resolved all comments.
Added demonstration video.
Passing all tests.

@Anshgrover23 PR is ready to be merged!

@sonarqubecloud
Copy link

Copy link
Collaborator

@Anshgrover23 Anshgrover23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KrishnaShuk added some comments, also there is missing test for ErrorCategory integration, add them.

Also, two things are missing from requirements:

  • Different strategies for different error types
  • Documentation with configuration examples

from collections.abc import Callable
from typing import Any

from cortex.error_parser import ErrorCategory, ErrorParser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrorParser class is unused, remove that.

timeout=300,
stop_on_error=True,
progress_callback=progress_callback,
max_retries=5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use magic numbers add a constant instead something like = DEFAULT_MAX_RETRIES = 5

Comment on lines +95 to +97
if not error_message:
# If no error message, assume it's a generic failure that might be transient
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is risky. A command could fail with returncode != 0 and empty stderr/stdout for permanent reasons. Consider logging a warning here.

Comment on lines +60 to +63
if hasattr(result, "stderr") and result.stderr:
error_msg = result.stderr
elif hasattr(result, "stdout") and result.stdout:
error_msg = result.stdout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stdout may contain success output, not error info. This could cause false retry classification. So fix it.

attempt += 1
sleep_time = self.backoff_factor * (2 ** (attempt - 1))

msg = f"⚠️ Transient error detected. Retrying in {sleep_time}s... (Attempt {attempt}/{self.max_retries})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first retry, attempt=1, so message shows "Attempt 1/5" but it's actually the 2nd execution. Consider clarifying: "Retry 1/5" instead.

Comment on lines +190 to +193
def status_callback(msg: str) -> None:
self._log(msg)
# Also print to stdout so the user sees the retry happening
print(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If progress_callback is also set, user may see duplicate output. Consider checking if a callback exists before printing.

):
if not isinstance(max_retries, int) or max_retries < 0:
raise ValueError("max_retries must be a non-negative integer")
if not isinstance(backoff_factor, (int, float)) or backoff_factor < 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reject backoff_factor=0 as well since it defeats the purpose of backoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Smart Retry Logic with Exponential Backoff

2 participants