feat: Add parallel LLM calls implementation (#276) #320

Sahilbhatane · 2025-12-20T15:36:54Z

Related Issue

Closes #276

Summary

This implementation adds parallel/concurrent LLM API call support to Cortex Linux, enabling 2-3x speedup for batch operations.

The previous architecture made sequential LLM calls, which was slow for:

Multi-package queries
Parallel error diagnosis
Concurrent hardware configuration checks

Concurrent Execution - Uses asyncio with semaphore-based concurrency control
Rate Limiting - Token bucket algorithm prevents API rate limit errors
Automatic Retries - Configurable retry with exponential backoff
Progress Callbacks - Optional per-query completion callbacks
Statistics Tracking - Total tokens, cost, success/failure counts

Checklist

Tests pass (pytest tests/)
PR title format: [#276 ] Description
MVP label added if closing MVP issue

Summary by CodeRabbit

Release Notes

New Features
- Added parallel batch processing for LLM requests with configurable rate limiting and automatic retries.
- Built-in helpers for common parallel query patterns including package analysis, error diagnosis, and hardware checks.
- Added progress callbacks and comprehensive execution statistics tracking token usage, costs, and timing metrics.
Documentation
- Added comprehensive guide for parallel LLM batch processing capabilities and usage patterns.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-20T15:37:03Z

Walkthrough

A new parallel LLM execution framework is introduced via cortex/parallel_llm.py, featuring concurrent query batching, token-bucket rate limiting, exponential backoff retry logic, result aggregation, and helper query generators. Supporting documentation and comprehensive unit tests are included.

Changes

Cohort / File(s)	Summary
Core Parallel LLM Implementation `cortex/parallel_llm.py`	Introduces `ParallelQuery`, `ParallelResult`, `BatchResult` data models; `RateLimiter` (token-bucket); `ParallelLLMExecutor` with async batch execution, concurrency control, rate limiting, and retry logic; helper functions `create_package_queries()`, `create_error_diagnosis_queries()`, `create_hardware_check_queries()`.
Documentation `docs/PARALLEL_LLM_IMPLEMENTATION.md`	Explains design, architecture, and usage patterns for the parallel LLM system, including configuration, examples, performance expectations, and future enhancements.
Unit Tests `tests/test_parallel_llm.py`	Comprehensive test suite covering data structures, rate limiting, executor initialization, batch execution, failure handling, retries, callbacks, query helpers, and async execution patterns.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Executor as ParallelLLMExecutor
    participant RateLimit as RateLimiter
    participant Router as LLMRouter
    participant ThreadPool as Thread Pool
    participant Results as Result Aggregator

    Caller->>Executor: execute_batch_async(queries)
    Executor->>Executor: create semaphore & tasks
    
    loop For each query
        Executor->>RateLimit: acquire()
        RateLimit-->>Executor: token available
        Executor->>ThreadPool: run_in_executor(router.complete)
        ThreadPool->>Router: complete(query)
        Router-->>ThreadPool: LLMResponse
        ThreadPool-->>Executor: response
        
        alt Success
            Executor->>Results: add ParallelResult
        else Failure & retry_enabled
            Executor->>Executor: exponential backoff
            Executor->>RateLimit: acquire()
            Executor->>ThreadPool: retry execute
        else Failure & no_retry
            Executor->>Results: add failed ParallelResult
        end
    end
    
    Executor->>Results: finalize BatchResult
    Results-->>Caller: BatchResult(aggregated stats)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Async concurrency patterns: Validate semaphore usage, proper task cancellation, and race condition prevention in concurrent execution
Rate limiter correctness: Verify token-bucket algorithm logic, timing accuracy, and edge cases (zero tokens, rapid acquire calls)
Retry and backoff logic: Confirm exponential backoff implementation, max retry enforcement, and proper error propagation
Thread pool offloading: Ensure blocking router.complete() calls don't starve the event loop
Result aggregation and error handling: Check token/cost calculations, missing result handling, and state consistency across concurrent executions

Possibly related PRs

PR #41: Introduces or modifies LLMRouter.complete() method that is directly invoked by ParallelLLMExecutor._execute_single(); changes to the router's API or behavior may impact parallel execution semantics.

Poem

🐰 Hops of joy for threads that race,
Rate limits keep a steady pace,
Batch them fast, no need to wait,
Parallel dreams—*2x faster—*great! ⚡
Token buckets overflow with glee,
Retries bloom like clover, you'll see! 🍀

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Add parallel LLM calls implementation (#276)' clearly and concisely describes the main change: introducing parallel LLM call support with issue reference.
Description check	✅ Passed	The PR description follows the template with Related Issue, comprehensive Summary explaining the rationale and implementation details, and completed Checklist items.
Linked Issues check	✅ Passed	The PR fully implements issue #276 requirements: concurrent execution with asyncio, rate limiting, automatic retries, progress callbacks, and statistics tracking for batch LLM operations.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing parallel LLM functionality: the new module, tests, and documentation align entirely with issue #276 objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 82.61% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2025-12-20T15:37:33Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot

Pull request overview

This PR adds parallel/concurrent LLM API call support to Cortex Linux, enabling significant performance improvements for batch operations. The implementation introduces a new parallel_llm.py module with async-based concurrent execution, rate limiting, and automatic retry capabilities.

Key Changes:

New ParallelLLMExecutor class for concurrent API calls with semaphore-based concurrency control
Token bucket rate limiter to prevent API throttling
Helper functions for common batch operations (package queries, error diagnosis, hardware checks)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 14 comments.

File	Description
cortex/parallel_llm.py	New module implementing parallel LLM execution with rate limiting, retries, and result aggregation
tests/test_parallel_llm.py	Comprehensive unit tests covering parallel execution, rate limiting, callbacks, and helper functions
docs/PARALLEL_LLM_IMPLEMENTATION.md	Documentation with usage examples, configuration options, and performance benchmarks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-20T15:41:09Z

docs/PARALLEL_LLM_IMPLEMENTATION.md

+from cortex.parallel_llm import ParallelLLMExecutor, ParallelQuery
+
+def on_complete(result):
+    status = "✓" if result.success else "✗"
+    print(f"{status} {result.query_id} completed in {result.execution_time:.2f}s")
+
+executor = ParallelLLMExecutor()
+# Use execute_with_callback_async for progress tracking


The documentation example is incomplete and shows # Use execute_with_callback_async for progress tracking without providing the actual implementation. This will confuse users trying to understand how to use the callback feature. Complete the example by showing how to use asyncio.run() with executor.execute_with_callback_async(queries, on_complete).

Suggested change

from cortex.parallel_llm import ParallelLLMExecutor, ParallelQuery

def on_complete(result):

status = "✓" if result.success else "✗"

print(f"{status} {result.query_id} completed in {result.execution_time:.2f}s")

executor = ParallelLLMExecutor()

# Use execute_with_callback_async for progress tracking

import asyncio

from cortex.parallel_llm import ParallelLLMExecutor, ParallelQuery

def on_complete(result):

status = "✓" if result.success else "✗"

print(f"{status} {result.query_id} completed in {result.execution_time:.2f}s")

async def run_with_progress():

executor = ParallelLLMExecutor()

queries = [

ParallelQuery(query_id="gpu_check", prompt="Analyze GPU configuration and health."),

ParallelQuery(query_id="cpu_check", prompt="Analyze CPU configuration and health."),

ParallelQuery(query_id="ram_check", prompt="Analyze RAM configuration and health."),

]

# Use execute_with_callback_async for progress tracking

await executor.execute_with_callback_async(queries, on_complete)

asyncio.run(run_with_progress())

Copilot · 2025-12-20T15:41:09Z

docs/PARALLEL_LLM_IMPLEMENTATION.md

+from cortex.llm_router import TaskType
+
+executor = ParallelLLMExecutor(max_concurrent=5, requests_per_second=10.0)


The example initializes ParallelLLMExecutor without providing a router, but the router parameter is shown as optional in the constructor documentation. When router is None, a new LLMRouter() is created. This should be explicitly mentioned in the usage example or shown as executor = ParallelLLMExecutor(router=router, max_concurrent=5, requests_per_second=10.0) to demonstrate proper initialization with an existing router instance.

Suggested change

from cortex.llm_router import TaskType

executor = ParallelLLMExecutor(max_concurrent=5, requests_per_second=10.0)

from cortex.llm_router import LLMRouter, TaskType

router = LLMRouter()

executor = ParallelLLMExecutor(router=router, max_concurrent=5, requests_per_second=10.0)

Copilot · 2025-12-20T15:41:09Z

tests/test_parallel_llm.py

+    def test_multiple_rapid_acquires(self):
+        """Test multiple rapid acquires work correctly."""
+        limiter = RateLimiter(requests_per_second=10.0)
+
+        async def run_test():
+            for _ in range(5):
+                await limiter.acquire()
+            # Should have consumed 5 tokens
+            self.assertLessEqual(limiter.tokens, 5.5)
+
+        asyncio.run(run_test())
+


Missing test coverage for the rate limiter's token replenishment behavior. While tests verify token consumption, there's no test that verifies tokens are correctly replenished over time according to the configured rate. Add a test that waits for tokens to replenish and verifies the rate limiter allows requests after waiting.

Copilot · 2025-12-20T15:41:09Z

tests/test_parallel_llm.py

+    def test_concurrent_execution_time(self):
+        """Test that parallel execution is faster than sequential."""
+        import time
+
+        delay_time = 0.1
+
+        def slow_complete(*args, **kwargs):
+            time.sleep(delay_time)
+            return self.mock_response
+
+        self.mock_router.complete.side_effect = slow_complete
+        executor = ParallelLLMExecutor(
+            router=self.mock_router,
+            max_concurrent=5,
+            requests_per_second=100.0,  # High rate to not limit
+        )
+        queries = [
+            ParallelQuery(id=f"speed_{i}", messages=[{"role": "user", "content": f"Test {i}"}])
+            for i in range(3)
+        ]
+
+        start = time.time()
+        result = executor.execute_batch(queries)
+        elapsed = time.time() - start
+
+        # Parallel should complete faster than 3 * delay_time
+        # Allow some overhead but should be significantly faster
+        self.assertLess(elapsed, 3 * delay_time * 0.9)
+        self.assertEqual(result.success_count, 3)
+


Missing test coverage for concurrent execution with actual rate limiting. While test_concurrent_execution_time tests parallelism with a high rate limit (100.0), there's no test that verifies rate limiting actually throttles requests when the limit is low. Add a test with a low rate limit that verifies requests are properly throttled over time.

Copilot · 2025-12-20T15:41:10Z

cortex/parallel_llm.py

+
+            async with self._semaphore:
+                # Run sync router.complete in thread pool
+                loop = asyncio.get_event_loop()


Potential resource leak with asyncio event loop. The code calls asyncio.get_event_loop() which may return a closed loop or create a new loop depending on the context. In Python 3.10+, this is deprecated in favor of asyncio.get_running_loop() which only works within an async context. Since this code is already in an async function, use asyncio.get_running_loop() instead to ensure you're getting the correct running loop and avoid deprecation warnings.

Suggested change

loop = asyncio.get_event_loop()

loop = asyncio.get_running_loop()

Copilot · 2025-12-20T15:41:11Z

cortex/parallel_llm.py

+    id: str
+    messages: list[dict[str, str]]
+    task_type: TaskType = TaskType.USER_CHAT
+    force_provider: LLMProvider | None = None


The comment states "exponential backoff" but the implementation uses linear backoff. The comment should be corrected to "linear backoff" to match the actual behavior, or the implementation should be fixed to use true exponential backoff (see related bug comment).

Copilot · 2025-12-20T15:41:11Z

cortex/parallel_llm.py

+        start_time = time.time()
+
+        try:
+            await self.rate_limiter.acquire()
+
+            async with self._semaphore:
+                # Run sync router.complete in thread pool
+                loop = asyncio.get_event_loop()
+                response = await loop.run_in_executor(
+                    None,
+                    lambda: self.router.complete(
+                        messages=query.messages,
+                        task_type=query.task_type,
+                        force_provider=query.force_provider,
+                        temperature=query.temperature,
+                        max_tokens=query.max_tokens,
+                    ),
+                )
+
+                return ParallelResult(
+                    query_id=query.id,
+                    response=response,
+                    success=True,
+                    execution_time=time.time() - start_time,


Mixing time.time() and time.monotonic() for timing calculations. The code uses time.monotonic() for rate limiting (line 92, 98) but time.time() for execution timing (lines 146, 169, 184). For measuring elapsed time, time.monotonic() is preferred as it's not affected by system clock adjustments. Consider using time.monotonic() consistently for all execution time measurements.

Copilot · 2025-12-20T15:41:11Z

cortex/parallel_llm.py

+    async def execute_with_callback_async(
+        self,
+        queries: list[ParallelQuery],
+        on_complete: Callable[[ParallelResult], None] | None = None,
+    ) -> BatchResult:


The method name execute_with_callback_async is inconsistent with the naming pattern of other methods. The other async method is named execute_batch_async (noun + async), but this is execute_with_callback_async (verb + prepositional phrase + async). For consistency, consider renaming to execute_batch_with_callback_async to maintain the parallel structure with execute_batch_async.

Copilot · 2025-12-20T15:41:11Z

cortex/parallel_llm.py

+            logger.warning(f"Query {query.id} failed (attempt {attempt + 1}): {e}")
+
+            if self.retry_failed and attempt < self.max_retries:
+                await asyncio.sleep(0.5 * (attempt + 1))  # exponential backoff


The exponential backoff implementation is incorrect. The current formula 0.5 * (attempt + 1) results in linear backoff (0.5s, 1.0s, 1.5s), not exponential backoff. For true exponential backoff, use a formula like 0.5 * (2 ** attempt) which would give delays of 0.5s, 1.0s, 2.0s.

Suggested change

await asyncio.sleep(0.5 * (attempt + 1)) # exponential backoff

await asyncio.sleep(0.5 * (2 ** attempt)) # exponential backoff

Copilot · 2025-12-20T15:41:11Z

tests/test_parallel_llm.py

+import os
+import sys
+import unittest
+from unittest.mock import MagicMock, Mock, patch


Import of 'MagicMock' is not used.
Import of 'patch' is not used.

Suggested change

from unittest.mock import MagicMock, Mock, patch

from unittest.mock import Mock

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

cortex/parallel_llm.py (2)
233-244: Consider documenting that this cannot be called from within an existing event loop.

asyncio.run() will raise RuntimeError if called from within an async context. Consider adding a note in the docstring, or use asyncio.get_event_loop().run_until_complete() for broader compatibility.

270-271: Remove unused variable assignment.

results = [] on line 271 is immediately overwritten by results = await asyncio.gather(*tasks) on line 280 and is never used.
🔎 Proposed fix
         start_time = time.time()
-        results = []
 
         async def execute_with_notify(query: ParallelQuery) -> ParallelResult:
tests/test_parallel_llm.py (1)

14-14: Consider using proper package structure instead of sys.path manipulation.

sys.path.insert(0, ...) is fragile. Consider using pytest with proper package configuration or conftest.py for path setup.
docs/PARALLEL_LLM_IMPLEMENTATION.md (1)
99-111: Incomplete callback example.

The "With Progress Callback" example defines on_complete but doesn't show how to use it with execute_with_callback_async. Consider completing the example for clarity.
🔎 Suggested completion
from cortex.parallel_llm import ParallelLLMExecutor, ParallelQuery
import asyncio

def on_complete(result):
    status = "✓" if result.success else "✗"
    print(f"{status} {result.query_id} completed in {result.execution_time:.2f}s")

async def run_with_callback():
    executor = ParallelLLMExecutor()
    queries = [
        ParallelQuery(id="q1", messages=[{"role": "user", "content": "Query 1"}]),
        ParallelQuery(id="q2", messages=[{"role": "user", "content": "Query 2"}]),
    ]
    return await executor.execute_with_callback_async(queries, on_complete)

asyncio.run(run_with_callback())

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c6fc2f and 572283a.

📒 Files selected for processing (3)

cortex/parallel_llm.py (1 hunks)
docs/PARALLEL_LLM_IMPLEMENTATION.md (1 hunks)
tests/test_parallel_llm.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Follow PEP 8 style guide
Type hints required in Python code
Docstrings required for all public APIs

Files:

tests/test_parallel_llm.py
cortex/parallel_llm.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Maintain >80% test coverage for pull requests

Files:

tests/test_parallel_llm.py

🧬 Code graph analysis (2)

tests/test_parallel_llm.py (2)

cortex/llm_router.py (3)

LLMProvider (43-47)

LLMResponse (51-60)

TaskType (30-40)

cortex/parallel_llm.py (13)

BatchResult (54-73)

ParallelQuery (30-39)

ParallelResult (43-50)

RateLimiter (76-108)

create_error_diagnosis_queries (329-360)

create_hardware_check_queries (363-394)

create_package_queries (297-326)

get_result (64-69)

successful_responses (71-73)

acquire (95-108)

execute_batch (233-243)

execute_with_callback_async (245-294)

execute_batch_async (187-231)

cortex/parallel_llm.py (2)

cortex/llm_router.py (4)

LLMProvider (43-47)

LLMResponse (51-60)

LLMRouter (73-418)

TaskType (30-40)

tests/test_parallel_llm.py (1)

on_complete (319-320)

🪛 LanguageTool

docs/PARALLEL_LLM_IMPLEMENTATION.md

[grammar] ~131-~131: Use a hyphen to join words.
Context: ...s - Parallel: ~1.2s (accounting for rate limiting overhead) **Speedup: ~4x for 5...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)

docs/PARALLEL_LLM_IMPLEMENTATION.md

133-133: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build Package
GitHub Check: Agent

🔇 Additional comments (17)

cortex/parallel_llm.py (7)

1-27: LGTM!

Module structure follows best practices with appropriate imports and a comprehensive docstring.

29-74: LGTM!

Data classes are well-structured with proper type hints and docstrings. The helper methods get_result and successful_responses are clear and functional.

111-143: LGTM!

Executor initialization is well-structured with sensible defaults and proper documentation.

187-232: LGTM!

The async batch execution is well-implemented with proper empty-input handling and statistics aggregation.

297-327: LGTM!

Helper function is clean and follows the expected pattern with proper documentation.

329-361: LGTM!

Error diagnosis query helper is well-structured with appropriate task type assignment.

363-394: LGTM!

Hardware check query helper follows the established pattern consistently.

tests/test_parallel_llm.py (7)

29-54: LGTM!

Tests for ParallelQuery dataclass cover creation and metadata scenarios appropriately.

56-91: LGTM!

Tests for ParallelResult cover both success and failure scenarios with appropriate assertions.

93-174: LGTM!

BatchResult tests comprehensively cover statistics, lookup, and filtering functionality.

176-206: LGTM!

Rate limiter tests cover initialization and token consumption. Consider adding a test for the waiting behavior when tokens are exhausted to improve coverage.

208-335: LGTM!

Executor tests are comprehensive, covering initialization, execution paths, failure handling, retry logic, and callbacks. Good use of mocking for the router.

337-386: LGTM!

Helper function tests verify correct ID generation, task type assignment, and custom template handling.

420-449: Timing-based test may be flaky in CI environments.

The test relies on wall-clock timing which can vary significantly under load. Consider increasing the tolerance or using a different approach (e.g., verifying that tasks ran concurrently by checking overlapping execution windows).

docs/PARALLEL_LLM_IMPLEMENTATION.md (3)

36-36: Inconsistency with implementation: backoff is linear, not exponential.

The documentation states "exponential backoff" but the implementation uses linear backoff (0.5 * (attempt + 1)). Update either the implementation or documentation to match.

127-134: LGTM!

Performance section provides useful context. The static analysis hints about hyphenation and emphasis are stylistic and can be safely ignored.

135-150: LGTM!

Documentation is comprehensive and provides clear guidance on testing and future directions.

coderabbitai · 2025-12-20T15:41:14Z

cortex/parallel_llm.py

+    async def acquire(self) -> None:
+        """Wait until a request token is available."""
+        async with self._lock:
+            now = time.monotonic()
+            elapsed = now - self.last_update
+            self.tokens = min(self.rate, self.tokens + elapsed * self.rate)
+            self.last_update = now
+
+            if self.tokens < 1:
+                wait_time = (1 - self.tokens) / self.rate
+                await asyncio.sleep(wait_time)
+                self.tokens = 0
+            else:
+                self.tokens -= 1
+


⚠️ Potential issue | 🟠 Major

Lock held during sleep blocks concurrency.

The acquire() method holds _lock while sleeping (line 105), which blocks all other coroutines from acquiring tokens during the wait. This serializes waiting callers instead of allowing them to calculate and wait concurrently.

🔎 Proposed fix: release lock before sleeping

async def acquire(self) -> None: """Wait until a request token is available.""" + wait_time = 0.0 async with self._lock: now = time.monotonic() elapsed = now - self.last_update self.tokens = min(self.rate, self.tokens + elapsed * self.rate) self.last_update = now if self.tokens < 1: wait_time = (1 - self.tokens) / self.rate - await asyncio.sleep(wait_time) self.tokens = 0 else: self.tokens -= 1 + return + + if wait_time > 0: + await asyncio.sleep(wait_time)

🤖 Prompt for AI Agents

In cortex/parallel_llm.py around lines 95 to 109, the acquire() method currently holds self._lock while awaiting asyncio.sleep, which blocks other coroutines; modify it to compute now, elapsed, and tentative tokens while holding the lock, calculate wait_time if tokens < 1, then release the lock before awaiting sleep; after the sleep re-acquire the lock, recompute now/elapsed/tokens (or update tokens using the elapsed since last_update), then decrement tokens by 1 and update last_update; ensure all state mutations (tokens and last_update) occur under the lock but the actual asyncio.sleep happens outside the locked section.

coderabbitai · 2025-12-20T15:41:14Z

cortex/parallel_llm.py

+                # Run sync router.complete in thread pool
+                loop = asyncio.get_event_loop()
+                response = await loop.run_in_executor(


🛠️ Refactor suggestion | 🟠 Major

Use asyncio.get_running_loop() instead of deprecated get_event_loop().

asyncio.get_event_loop() is deprecated in Python 3.10+ and will emit a deprecation warning when called from a coroutine.

🔎 Proposed fix

- loop = asyncio.get_event_loop() + loop = asyncio.get_running_loop()

🤖 Prompt for AI Agents

In cortex/parallel_llm.py around lines 152 to 154, the code calls asyncio.get_event_loop() from within a coroutine which is deprecated; replace that call with asyncio.get_running_loop() so the coroutine obtains the currently running event loop (keeping the existing loop.run_in_executor(...) usage unchanged) to avoid the deprecation warning and preserve behavior.

coderabbitai · 2025-12-20T15:41:14Z

cortex/parallel_llm.py

+            if self.retry_failed and attempt < self.max_retries:
+                await asyncio.sleep(0.5 * (attempt + 1))  # exponential backoff
+                return await self._execute_single(query, attempt + 1)


⚠️ Potential issue | 🟡 Minor

Backoff is linear, not exponential.

The delay 0.5 * (attempt + 1) produces linear backoff (0.5s, 1.0s, 1.5s). True exponential backoff would be 0.5 * (2 ** attempt) (0.5s, 1.0s, 2.0s).

Consider updating either the implementation or the docstring/comments to match.

🔎 Proposed fix for exponential backoff

if self.retry_failed and attempt < self.max_retries: - await asyncio.sleep(0.5 * (attempt + 1)) # exponential backoff + await asyncio.sleep(0.5 * (2 ** attempt)) # exponential backoff return await self._execute_single(query, attempt + 1)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self.retry_failed and attempt < self.max_retries:

await asyncio.sleep(0.5 * (attempt + 1)) # exponential backoff

return await self._execute_single(query, attempt + 1)

if self.retry_failed and attempt < self.max_retries:

await asyncio.sleep(0.5 * (2 ** attempt)) # exponential backoff

return await self._execute_single(query, attempt + 1)

🤖 Prompt for AI Agents

In cortex/parallel_llm.py around lines 175 to 177, the retry delay uses 0.5 * (attempt + 1) which yields linear backoff (0.5s, 1.0s, 1.5s); change it to exponential backoff by using 0.5 * (2 ** attempt) so delays become 0.5s, 1.0s, 2.0s, etc., or alternatively update the surrounding comment/docstring to state that the current behavior is linear backoff if you want to keep the existing formula.

feat: Add parallel LLM calls implementation (cortexlinux#276)

572283a

Sahilbhatane requested a review from Suyashd999 December 20, 2025 15:36

Sahilbhatane self-assigned this Dec 20, 2025

Copilot AI review requested due to automatic review settings December 20, 2025 15:36

Sahilbhatane added the MVP Killer feature sprint label Dec 20, 2025

Sahilbhatane requested a review from mikejmorgan-ai as a code owner December 20, 2025 15:36

Copilot started reviewing on behalf of Sahilbhatane December 20, 2025 15:37 View session

Copilot AI reviewed Dec 20, 2025

View reviewed changes

coderabbitai bot reviewed Dec 20, 2025

View reviewed changes

mikejmorgan-ai merged commit 2c8c3d7 into cortexlinux:main Dec 20, 2025
19 checks passed

coderabbitai bot mentioned this pull request Dec 24, 2025

[FEATURE] Ollama Integration - Local LLM Support #357

Closed

		from cortex.llm_router import TaskType

		executor = ParallelLLMExecutor(max_concurrent=5, requests_per_second=10.0)

-from cortex.llm_router import TaskType
-executor = ParallelLLMExecutor(max_concurrent=5, requests_per_second=10.0)
+from cortex.llm_router import LLMRouter, TaskType
+router = LLMRouter()
+executor = ParallelLLMExecutor(router=router, max_concurrent=5, requests_per_second=10.0)

	loop = asyncio.get_event_loop()
	loop = asyncio.get_running_loop()

	await asyncio.sleep(0.5 * (attempt + 1)) # exponential backoff
	await asyncio.sleep(0.5 * (2 ** attempt)) # exponential backoff

	from unittest.mock import MagicMock, Mock, patch
	from unittest.mock import Mock

Uh oh!

feat: Add parallel LLM calls implementation (#276) #320

feat: Add parallel LLM calls implementation (#276) #320

Uh oh!

Conversation

Sahilbhatane commented Dec 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Summary

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

sonarqubecloud bot commented Dec 20, 2025

Quality Gate passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Sahilbhatane commented Dec 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 20, 2025 •

edited

Loading