-
-
Notifications
You must be signed in to change notification settings - Fork 52
feat: Add parallel LLM calls implementation (#276) #320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,394 @@ | ||||||||||||||||||||||||
| #!/usr/bin/env python3 | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
| Parallel LLM Executor for Cortex Linux | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Enables concurrent LLM API calls with rate limiting for 2-3x speedup. | ||||||||||||||||||||||||
| Batches independent queries and aggregates responses. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Use cases: | ||||||||||||||||||||||||
| - Multi-package queries (analyze multiple packages simultaneously) | ||||||||||||||||||||||||
| - Parallel error diagnosis | ||||||||||||||||||||||||
| - Concurrent hardware config checks | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Author: Cortex Linux Team | ||||||||||||||||||||||||
| License: Apache 2.0 | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| import asyncio | ||||||||||||||||||||||||
| import logging | ||||||||||||||||||||||||
| import time | ||||||||||||||||||||||||
| from collections.abc import Callable | ||||||||||||||||||||||||
| from dataclasses import dataclass, field | ||||||||||||||||||||||||
| from typing import Any | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| from cortex.llm_router import LLMProvider, LLMResponse, LLMRouter, TaskType | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| logger = logging.getLogger(__name__) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| @dataclass | ||||||||||||||||||||||||
| class ParallelQuery: | ||||||||||||||||||||||||
| """A single query to be executed in parallel.""" | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| id: str | ||||||||||||||||||||||||
| messages: list[dict[str, str]] | ||||||||||||||||||||||||
| task_type: TaskType = TaskType.USER_CHAT | ||||||||||||||||||||||||
| force_provider: LLMProvider | None = None | ||||||||||||||||||||||||
| temperature: float = 0.7 | ||||||||||||||||||||||||
| max_tokens: int = 4096 | ||||||||||||||||||||||||
| metadata: dict[str, Any] = field(default_factory=dict) | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| @dataclass | ||||||||||||||||||||||||
| class ParallelResult: | ||||||||||||||||||||||||
| """Result of a parallel query execution.""" | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| query_id: str | ||||||||||||||||||||||||
| response: LLMResponse | None | ||||||||||||||||||||||||
| error: str | None = None | ||||||||||||||||||||||||
| success: bool = True | ||||||||||||||||||||||||
| execution_time: float = 0.0 | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| @dataclass | ||||||||||||||||||||||||
| class BatchResult: | ||||||||||||||||||||||||
| """Aggregated results from a batch of parallel queries.""" | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| results: list[ParallelResult] | ||||||||||||||||||||||||
| total_time: float | ||||||||||||||||||||||||
| total_tokens: int | ||||||||||||||||||||||||
| total_cost: float | ||||||||||||||||||||||||
| success_count: int | ||||||||||||||||||||||||
| failure_count: int | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| def get_result(self, query_id: str) -> ParallelResult | None: | ||||||||||||||||||||||||
| """Get result by query ID.""" | ||||||||||||||||||||||||
| for r in self.results: | ||||||||||||||||||||||||
| if r.query_id == query_id: | ||||||||||||||||||||||||
| return r | ||||||||||||||||||||||||
| return None | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| def successful_responses(self) -> list[LLMResponse]: | ||||||||||||||||||||||||
| """Get all successful LLM responses.""" | ||||||||||||||||||||||||
| return [r.response for r in self.results if r.success and r.response] | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| class RateLimiter: | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
| Token bucket rate limiter for API calls. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Limits requests per second to avoid hitting provider rate limits. | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| def __init__(self, requests_per_second: float = 5.0): | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
| Initialize rate limiter. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| Args: | ||||||||||||||||||||||||
| requests_per_second: Max requests allowed per second | ||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||
| self.rate = requests_per_second | ||||||||||||||||||||||||
| self.tokens = requests_per_second | ||||||||||||||||||||||||
| self.last_update = time.monotonic() | ||||||||||||||||||||||||
| self._lock = asyncio.Lock() | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| async def acquire(self) -> None: | ||||||||||||||||||||||||
| """Wait until a request token is available.""" | ||||||||||||||||||||||||
|
Comment on lines
+93
to
+96
|
||||||||||||||||||||||||
| self._lock = asyncio.Lock() | |
| async def acquire(self) -> None: | |
| """Wait until a request token is available.""" | |
| # Lazily initialize the asyncio.Lock in acquire() to bind it to the correct event loop. | |
| self._lock: asyncio.Lock | None = None | |
| async def acquire(self) -> None: | |
| """Wait until a request token is available.""" | |
| if self._lock is None: | |
| self._lock = asyncio.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock held during sleep blocks concurrency.
The acquire() method holds _lock while sleeping (line 105), which blocks all other coroutines from acquiring tokens during the wait. This serializes waiting callers instead of allowing them to calculate and wait concurrently.
🔎 Proposed fix: release lock before sleeping
async def acquire(self) -> None:
"""Wait until a request token is available."""
+ wait_time = 0.0
async with self._lock:
now = time.monotonic()
elapsed = now - self.last_update
self.tokens = min(self.rate, self.tokens + elapsed * self.rate)
self.last_update = now
if self.tokens < 1:
wait_time = (1 - self.tokens) / self.rate
- await asyncio.sleep(wait_time)
self.tokens = 0
else:
self.tokens -= 1
+ return
+
+ if wait_time > 0:
+ await asyncio.sleep(wait_time)🤖 Prompt for AI Agents
In cortex/parallel_llm.py around lines 95 to 109, the acquire() method currently
holds self._lock while awaiting asyncio.sleep, which blocks other coroutines;
modify it to compute now, elapsed, and tentative tokens while holding the lock,
calculate wait_time if tokens < 1, then release the lock before awaiting sleep;
after the sleep re-acquire the lock, recompute now/elapsed/tokens (or update
tokens using the elapsed since last_update), then decrement tokens by 1 and
update last_update; ensure all state mutations (tokens and last_update) occur
under the lock but the actual asyncio.sleep happens outside the locked section.
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semaphore is created during initialization but this may cause issues if the executor is reused across different event loops. The semaphore is bound to the event loop where it was created, which could lead to runtime errors if execute_batch is called from different contexts. Consider creating the semaphore lazily within the async methods or documenting this limitation.
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential resource leak with asyncio event loop. The code calls asyncio.get_event_loop() which may return a closed loop or create a new loop depending on the context. In Python 3.10+, this is deprecated in favor of asyncio.get_running_loop() which only works within an async context. Since this code is already in an async function, use asyncio.get_running_loop() instead to ensure you're getting the correct running loop and avoid deprecation warnings.
| loop = asyncio.get_event_loop() | |
| loop = asyncio.get_running_loop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Use asyncio.get_running_loop() instead of deprecated get_event_loop().
asyncio.get_event_loop() is deprecated in Python 3.10+ and will emit a deprecation warning when called from a coroutine.
🔎 Proposed fix
- loop = asyncio.get_event_loop()
+ loop = asyncio.get_running_loop()🤖 Prompt for AI Agents
In cortex/parallel_llm.py around lines 152 to 154, the code calls
asyncio.get_event_loop() from within a coroutine which is deprecated; replace
that call with asyncio.get_running_loop() so the coroutine obtains the currently
running event loop (keeping the existing loop.run_in_executor(...) usage
unchanged) to avoid the deprecation warning and preserve behavior.
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing time.time() and time.monotonic() for timing calculations. The code uses time.monotonic() for rate limiting (line 92, 98) but time.time() for execution timing (lines 146, 169, 184). For measuring elapsed time, time.monotonic() is preferred as it's not affected by system clock adjustments. Consider using time.monotonic() consistently for all execution time measurements.
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exponential backoff implementation is incorrect. The current formula 0.5 * (attempt + 1) results in linear backoff (0.5s, 1.0s, 1.5s), not exponential backoff. For true exponential backoff, use a formula like 0.5 * (2 ** attempt) which would give delays of 0.5s, 1.0s, 2.0s.
| await asyncio.sleep(0.5 * (attempt + 1)) # exponential backoff | |
| await asyncio.sleep(0.5 * (2 ** attempt)) # exponential backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backoff is linear, not exponential.
The delay 0.5 * (attempt + 1) produces linear backoff (0.5s, 1.0s, 1.5s). True exponential backoff would be 0.5 * (2 ** attempt) (0.5s, 1.0s, 2.0s).
Consider updating either the implementation or the docstring/comments to match.
🔎 Proposed fix for exponential backoff
if self.retry_failed and attempt < self.max_retries:
- await asyncio.sleep(0.5 * (attempt + 1)) # exponential backoff
+ await asyncio.sleep(0.5 * (2 ** attempt)) # exponential backoff
return await self._execute_single(query, attempt + 1)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.retry_failed and attempt < self.max_retries: | |
| await asyncio.sleep(0.5 * (attempt + 1)) # exponential backoff | |
| return await self._execute_single(query, attempt + 1) | |
| if self.retry_failed and attempt < self.max_retries: | |
| await asyncio.sleep(0.5 * (2 ** attempt)) # exponential backoff | |
| return await self._execute_single(query, attempt + 1) |
🤖 Prompt for AI Agents
In cortex/parallel_llm.py around lines 175 to 177, the retry delay uses 0.5 *
(attempt + 1) which yields linear backoff (0.5s, 1.0s, 1.5s); change it to
exponential backoff by using 0.5 * (2 ** attempt) so delays become 0.5s, 1.0s,
2.0s, etc., or alternatively update the surrounding comment/docstring to state
that the current behavior is linear backoff if you want to keep the existing
formula.
Copilot
AI
Dec 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method name execute_with_callback_async is inconsistent with the naming pattern of other methods. The other async method is named execute_batch_async (noun + async), but this is execute_with_callback_async (verb + prepositional phrase + async). For consistency, consider renaming to execute_batch_with_callback_async to maintain the parallel structure with execute_batch_async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "exponential backoff" but the implementation uses linear backoff. The comment should be corrected to "linear backoff" to match the actual behavior, or the implementation should be fixed to use true exponential backoff (see related bug comment).