-
-
Notifications
You must be signed in to change notification settings - Fork 52
feat: Implement Model Lifecycle Manager - Systemd-based LLM service management #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement Model Lifecycle Manager - Systemd-based LLM service management #234
Conversation
…anagement Complete implementation of issue cortexlinux#220 ($150 bounty): Features: - Systemd service generation for any LLM backend - Multi-backend support (vLLM, llama.cpp, Ollama, TGI) - Health check monitoring with auto-restart on failure - Resource limits via systemd (CPU, memory, I/O, tasks) - Security hardening (NoNewPrivileges, ProtectSystem, etc.) - SQLite database for configuration persistence - Event logging for audit trail - Boot auto-start via systemd enable Files: - model_lifecycle.py: 993 lines of implementation - test_model_lifecycle.py: 907 lines, 63 tests (all passing) - README_MODEL_LIFECYCLE.md: Complete documentation Closes cortexlinux#220
WalkthroughIntroduces a comprehensive Model Lifecycle Manager subsystem enabling systemd-based LLM service management for Cortex. Adds persistent SQLite storage, multi-backend support (vLLM, llama.cpp, TGI, Ollama), health monitoring with auto-restart, resource configuration, and rich CLI commands for model registration, start/stop, enable/disable workflows. Changes
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant Manager as ModelLifecycleManager
participant DB as ModelDatabase
participant Gen as ServiceGenerator
participant Systemd as systemd
participant HC as HealthChecker
participant Service as LLM Service
rect rgb(200, 220, 255)
note over User,Systemd: Model Registration & Start Flow
User->>Manager: register(config)
Manager->>DB: save_model(config)
DB-->>Manager: ✓ saved
Manager->>Gen: generate(config)
Gen-->>Manager: systemd unit file
Manager->>Systemd: write unit + daemon-reload
Systemd-->>Manager: ✓ loaded
Manager->>DB: log_event(REGISTERED)
DB-->>Manager: ✓ logged
Manager-->>User: registration complete
User->>Manager: start(name)
Manager->>Systemd: systemctl start service
Systemd->>Service: launch LLM backend
Service-->>Systemd: running
Systemd-->>Manager: ✓ started
Manager->>DB: log_event(STARTED)
Manager-->>User: ✓ started
end
rect rgb(220, 240, 220)
note over HC,Service: Health Monitoring & Auto-Restart
HC->>HC: monitor interval elapsed
HC->>Service: GET /health
Service-->>HC: 200 OK
HC->>DB: log_event(HEALTH_CHECK_PASSED)
DB-->>HC: ✓ logged
HC->>Service: GET /health (later)
Service-->>HC: timeout/error
HC->>HC: increment failure count
alt Failure threshold reached
HC->>Systemd: systemctl restart service
Systemd->>Service: restart LLM backend
Service-->>Systemd: running (restarted)
HC->>DB: log_event(AUTO_RESTARTED)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Possibly related PRs
Suggested labels
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (8)
README_MODEL_LIFECYCLE.md (1)
188-206: Add language specifier to fenced code block.The architecture diagram code block lacks a language specifier. While this is plaintext, adding a specifier improves rendering consistency.
-``` +```text ModelLifecycleManagertests/test_model_lifecycle.py (1)
355-361: Rename unused loop variable to_.The loop variable
iis not used within the loop body. Per Python convention, rename it to_to indicate it's intentionally unused.- for i in range(10): + for _ in range(10): self.db.log_event("test", EventType.STARTED)cortex/kernel_features/model_lifecycle.py (6)
239-246: Use explicitOptionaltype hint.PEP 484 prohibits implicit
Optional. Thedetailsparameter should be explicitly typed.- def log_event(self, model_name: str, event_type: EventType, details: str = None) -> None: + def log_event(self, model_name: str, event_type: EventType, details: Optional[str] = None) -> None:
248-267: Use explicitOptionaltype hint formodel_nameparameter.- def get_events(self, model_name: str = None, limit: int = 100) -> List[Dict[str, Any]]: + def get_events(self, model_name: Optional[str] = None, limit: int = 100) -> List[Dict[str, Any]]:
270-317: Annotate class-level mutable attributes withClassVar.The
BACKENDSandHEALTH_ENDPOINTSdictionaries are class-level constants and should be annotated withClassVarto clarify they are not instance attributes.+from typing import ClassVar + class ServiceGenerator: """Generate systemd service files for LLM backends.""" # Backend command templates - BACKENDS = { + BACKENDS: ClassVar[Dict[str, str]] = { ... } # Health check endpoints by backend - HEALTH_ENDPOINTS = { + HEALTH_ENDPOINTS: ClassVar[Dict[str, str]] = { ... }
331-336: Deduplicate identical branches for dtype handling.The
vllmandtgibranches have identical logic for dtype argument construction.dtype_arg = "" if config.dtype != "auto": - if config.backend == "vllm": - dtype_arg = f"--dtype {config.dtype}" - elif config.backend == "tgi": + if config.backend in ("vllm", "tgi"): dtype_arg = f"--dtype {config.dtype}"
502-544: High cognitive complexity inmonitor_loop- consider refactoring.The nested
monitor_loopclosure has high cognitive complexity (17 vs allowed 15 per SonarCloud). Consider extracting the health check result handling into a separate method.def _handle_health_result(self, name: str, healthy: bool, msg: str, hc: HealthCheckConfig) -> float: """Handle health check result. Returns sleep duration.""" if healthy: if self._failure_counts.get(name, 0) > 0: self.manager.db.log_event(name, EventType.HEALTH_CHECK_PASSED, msg) self._failure_counts[name] = 0 return hc.interval_seconds self._failure_counts[name] = self._failure_counts.get(name, 0) + 1 self.manager.db.log_event( name, EventType.HEALTH_CHECK_FAILED, f"Failure {self._failure_counts[name]}/{hc.max_failures}: {msg}" ) if self._failure_counts[name] >= hc.max_failures: self.manager.db.log_event(name, EventType.AUTO_RESTARTED) self.manager.restart(name, log_event=False) self._failure_counts[name] = 0 return hc.startup_delay_seconds return hc.interval_seconds
600-605: Narrow the exception handler to specific exceptions.Catching bare
Exceptioncan mask programming errors. Consider catching specific exceptions likeOSError,sqlite3.Error, etc.- except Exception as e: + except (OSError, sqlite3.Error, subprocess.SubprocessError) as e: self.db.log_event(config.name, EventType.ERROR, str(e)) print(f"Failed to register '{config.name}': {e}") return FalseApply similar changes to
unregister()at lines 635-638.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
README_MODEL_LIFECYCLE.md(1 hunks)cortex/kernel_features/model_lifecycle.py(1 hunks)tests/test_model_lifecycle.py(1 hunks)
🧰 Additional context used
🪛 GitHub Check: SonarCloud Code Analysis
tests/test_model_lifecycle.py
[warning] 357-357: Replace the unused loop index "i" with "_".
cortex/kernel_features/model_lifecycle.py
[failure] 68-68: Define a constant instead of duplicating this literal "/health" 5 times.
[warning] 336-336: Either merge this branch with the identical one on line "334" or change one of the implementations.
[failure] 860-860: Define a constant instead of duplicating this literal "Model name" 9 times.
[failure] 502-502: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.
[failure] 839-839: Refactor this function to reduce its Cognitive Complexity from 21 to the 15 allowed.
🪛 markdownlint-cli2 (0.18.1)
README_MODEL_LIFECYCLE.md
188-188: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.7)
tests/test_model_lifecycle.py
1-1: Shebang is present but file is not executable
(EXE001)
182-182: Possible binding to all interfaces
(S104)
357-357: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
541-541: Unused method argument: mock_run
(ARG002)
756-756: Possible binding to all interfaces
(S104)
cortex/kernel_features/model_lifecycle.py
239-239: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
248-248: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
274-309: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
312-317: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
492-492: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
493-493: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
499-499: Do not catch blind exception: Exception
(BLE001)
560-560: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
569-569: subprocess call: check for execution of untrusted input
(S603)
570-570: Consider ["systemctl", "--user", *list(args)] instead of concatenation
Replace with ["systemctl", "--user", *list(args)]
(RUF005)
601-601: Consider moving this statement to an else block
(TRY300)
602-602: Do not catch blind exception: Exception
(BLE001)
634-634: Consider moving this statement to an else block
(TRY300)
635-635: Do not catch blind exception: Exception
(BLE001)
782-782: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
820-820: subprocess call: check for execution of untrusted input
(S603)
822-822: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Agent
🔇 Additional comments (8)
tests/test_model_lifecycle.py (3)
540-544: Unusedmock_runparameter is intentional.The
mock_runparameter appears unused but is necessary to prevent actualsubprocess.runcalls during the test. The@patchdecorator requires a corresponding parameter. This is correct.
656-712: Good coverage of health check scenarios.The HealthChecker tests cover success, bad status codes, and connection errors with proper mocking of
urllib.request.urlopen. The context manager mock pattern is correctly implemented.
746-831: Thorough serialization round-trip testing.The configuration serialization tests verify nested dataclass handling and JSON compatibility. This ensures database persistence integrity.
cortex/kernel_features/model_lifecycle.py (5)
64-79: Well-designed dataclass with serialization support.The
HealthCheckConfigdataclass provides sensible defaults and properto_dict/from_dictmethods for persistence. The field filtering infrom_dicthandles forward compatibility gracefully.
173-237: SQLite database implementation is solid.Good use of:
- UTC timestamps for consistency
- Upsert pattern with
ON CONFLICT- Index creation for query performance
- Parameterized queries preventing SQL injection
431-470: Service file generation is well-structured.The generated systemd unit file includes proper sections for resources, security hardening, restart policies, and logging. The template approach is clean and maintainable.
814-820: Subprocess call inlogs()should use explicit argument list.The
subprocess.run()call passesargsdirectly. While the list is constructed internally (safe from injection), thenameparameter should be validated to ensure it doesn't contain malicious characters before being passed to_service_name().
567-577: Validate model names to prevent systemd service naming issues.Model names are passed to
systemctlvia subprocess without validation. Whilesubprocess.runwith a list avoids shell injection, special characters in service names could cause issues with systemd service registration and management.Consider adding name validation in
register()orModelConfigto ensure model names conform to systemd service naming requirements:import re def _validate_name(name: str) -> bool: """Validate model name for safe systemd service naming.""" # Allow alphanumeric, dash, underscore only return bool(re.match(r'^[a-zA-Z0-9][a-zA-Z0-9_-]*$', name))
| def check_health(self, config: ModelConfig) -> Tuple[bool, str]: | ||
| """Perform a single health check.""" | ||
| url = config.get_health_url() | ||
| timeout = config.health_check.timeout_seconds | ||
|
|
||
| try: | ||
| req = urllib.request.Request(url, method='GET') | ||
| with urllib.request.urlopen(req, timeout=timeout) as response: | ||
| if response.status == 200: | ||
| return True, "OK" | ||
| return False, f"HTTP {response.status}" | ||
| except urllib.error.URLError as e: | ||
| return False, f"Connection failed: {e.reason}" | ||
| except Exception as e: | ||
| return False, str(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate URL scheme before opening to prevent SSRF.
The check_health method opens URLs constructed from user-provided host values without validating the scheme. This could allow file:// or other dangerous schemes if a malicious host value is provided.
def check_health(self, config: ModelConfig) -> Tuple[bool, str]:
"""Perform a single health check."""
url = config.get_health_url()
+ # Validate URL scheme to prevent SSRF
+ if not url.startswith(('http://', 'https://')):
+ return False, f"Invalid URL scheme: {url}"
timeout = config.health_check.timeout_seconds
try:
req = urllib.request.Request(url, method='GET')
with urllib.request.urlopen(req, timeout=timeout) as response:
if response.status == 200:
return True, "OK"
return False, f"HTTP {response.status}"
except urllib.error.URLError as e:
return False, f"Connection failed: {e.reason}"
- except Exception as e:
- return False, str(e)
+ except (TimeoutError, OSError) as e:
+ return False, f"Connection error: {e}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def check_health(self, config: ModelConfig) -> Tuple[bool, str]: | |
| """Perform a single health check.""" | |
| url = config.get_health_url() | |
| timeout = config.health_check.timeout_seconds | |
| try: | |
| req = urllib.request.Request(url, method='GET') | |
| with urllib.request.urlopen(req, timeout=timeout) as response: | |
| if response.status == 200: | |
| return True, "OK" | |
| return False, f"HTTP {response.status}" | |
| except urllib.error.URLError as e: | |
| return False, f"Connection failed: {e.reason}" | |
| except Exception as e: | |
| return False, str(e) | |
| def check_health(self, config: ModelConfig) -> Tuple[bool, str]: | |
| """Perform a single health check.""" | |
| url = config.get_health_url() | |
| # Validate URL scheme to prevent SSRF | |
| if not url.startswith(('http://', 'https://')): | |
| return False, f"Invalid URL scheme: {url}" | |
| timeout = config.health_check.timeout_seconds | |
| try: | |
| req = urllib.request.Request(url, method='GET') | |
| with urllib.request.urlopen(req, timeout=timeout) as response: | |
| if response.status == 200: | |
| return True, "OK" | |
| return False, f"HTTP {response.status}" | |
| except urllib.error.URLError as e: | |
| return False, f"Connection failed: {e.reason}" | |
| except (TimeoutError, OSError) as e: | |
| return False, f"Connection error: {e}" |
🧰 Tools
🪛 Ruff (0.14.7)
492-492: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
493-493: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
499-499: Do not catch blind exception: Exception
(BLE001)
| if args.command == "register": | ||
| config = ModelConfig( | ||
| name=args.name, | ||
| model_path=args.path, | ||
| backend=args.backend, | ||
| port=args.port, | ||
| host=args.host, | ||
| gpu_ids=[int(x) for x in args.gpus.split(",")], | ||
| max_model_len=args.max_model_len, | ||
| tensor_parallel_size=args.tensor_parallel, | ||
| quantization=args.quantization, | ||
| extra_args=args.extra_args, | ||
| health_check=HealthCheckConfig( | ||
| enabled=not args.no_health_check, | ||
| endpoint=ServiceGenerator().get_default_health_endpoint(args.backend) | ||
| ), | ||
| resources=ResourceLimits( | ||
| memory_max=args.memory, | ||
| cpu_quota=args.cpu | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU ID parsing lacks validation.
The GPU ID parsing splits on comma and converts to int without validation. Invalid input like "0,abc,1" will raise an unhandled ValueError.
+def parse_gpu_ids(gpus_str: str) -> List[int]:
+ """Parse comma-separated GPU IDs with validation."""
+ try:
+ ids = [int(x.strip()) for x in gpus_str.split(",") if x.strip()]
+ if not ids:
+ raise ValueError("No valid GPU IDs provided")
+ if any(id < 0 for id in ids):
+ raise ValueError("GPU IDs must be non-negative")
+ return ids
+ except ValueError as e:
+ raise argparse.ArgumentTypeError(f"Invalid GPU IDs '{gpus_str}': {e}")
# In register command setup:
-reg.add_argument("--gpus", default="0", help="Comma-separated GPU IDs")
+reg.add_argument("--gpus", default="0", type=parse_gpu_ids, help="Comma-separated GPU IDs")Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cortex/kernel_features/model_lifecycle.py around lines 930 to 950, the GPU ID
parsing (gpu_ids=[int(x) for x in args.gpus.split(",")]) lacks validation and
will raise on invalid entries; change it to robustly parse by splitting on
commas, stripping whitespace, ignoring empty entries, and validating each token
is a non-negative integer (or raise a clear ValueError with a helpful message),
or alternatively collect only valid ints and log/warn about skipped invalid
tokens; ensure the final gpu_ids is a list of ints and handle the case of an
empty args.gpus string.
| | Backend | Command | Health Endpoint | | ||
| |---------|---------|-----------------|| | ||
| | vLLM | `python -m vllm.entrypoints.openai.api_server` | `/health` | | ||
| | llama.cpp | `llama-server` | `/health` | | ||
| | Ollama | `ollama serve` | `/api/tags` | | ||
| | TGI | `text-generation-launcher` | `/health` | | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Malformed Markdown table syntax.
Line 37 has a double pipe || at the end of the header separator row which will break table rendering in most Markdown processors.
-|---------|---------|-----------------||
+|---------|---------|-----------------|📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | Backend | Command | Health Endpoint | | |
| |---------|---------|-----------------|| | |
| | vLLM | `python -m vllm.entrypoints.openai.api_server` | `/health` | | |
| | llama.cpp | `llama-server` | `/health` | | |
| | Ollama | `ollama serve` | `/api/tags` | | |
| | TGI | `text-generation-launcher` | `/health` | | |
| | Backend | Command | Health Endpoint | | |
| |---------|---------|-----------------| | |
| | vLLM | `python -m vllm.entrypoints.openai.api_server` | `/health` | | |
| | llama.cpp | `llama-server` | `/health` | | |
| | Ollama | `ollama serve` | `/api/tags` | | |
| | TGI | `text-generation-launcher` | `/health` | | |
🤖 Prompt for AI Agents
In README_MODEL_LIFECYCLE.md around lines 36 to 42, the Markdown table header
separator row contains an extra trailing pipe ("||") which breaks table
rendering; remove the extra pipe so the separator row has exactly three columns
(e.g., |---------|---------|-----------------|) and verify all data rows have
three pipe-separated columns to restore correct table formatting.
| - `cortex/kernel_features/model_lifecycle.py` - Main implementation (~1000 lines) | ||
| - `tests/test_model_lifecycle.py` - Unit tests (~650 lines, 50+ tests) | ||
| - `README_MODEL_LIFECYCLE.md` - This documentation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation line counts are outdated.
The test file is described as "~650 lines, 50+ tests" but the actual file has 907 lines and 63 tests according to the PR description. Consider updating to reflect current state.
-- `tests/test_model_lifecycle.py` - Unit tests (~650 lines, 50+ tests)
+- `tests/test_model_lifecycle.py` - Unit tests (~900 lines, 63 tests)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `cortex/kernel_features/model_lifecycle.py` - Main implementation (~1000 lines) | |
| - `tests/test_model_lifecycle.py` - Unit tests (~650 lines, 50+ tests) | |
| - `README_MODEL_LIFECYCLE.md` - This documentation | |
| - `cortex/kernel_features/model_lifecycle.py` - Main implementation (~1000 lines) | |
| - `tests/test_model_lifecycle.py` - Unit tests (~900 lines, 63 tests) | |
| - `README_MODEL_LIFECYCLE.md` - This documentation |
🤖 Prompt for AI Agents
In README_MODEL_LIFECYCLE.md around lines 291 to 293, the test file line counts
are stale; update the bullet to reflect the current test file size and count by
replacing "~650 lines, 50+ tests" with the current values (907 lines, 63 tests)
so the entry reads "- `tests/test_model_lifecycle.py` - Unit tests (~907 lines,
63 tests)". Ensure formatting and punctuation match the surrounding list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a comprehensive Model Lifecycle Manager that brings systemd-based service management to LLM models in Cortex Linux. The implementation provides a production-ready solution for managing AI models as first-class system services with health monitoring, resource limits, and security hardening.
Key Changes
- Core Implementation: 993-line model lifecycle manager with SQLite persistence, systemd service generation for 4 backends (vLLM, llama.cpp, Ollama, TGI), and HTTP health check monitoring
- Comprehensive Testing: 907-line test suite with 63 unit tests covering all major functionality including configuration, database operations, service generation, and lifecycle management
- Documentation: Complete user guide with usage examples, architecture overview, and configuration reference
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
cortex/kernel_features/model_lifecycle.py |
Main implementation providing ModelConfig dataclasses, SQLite database layer, systemd service file generation, lifecycle operations (register/start/stop/enable), health monitoring with auto-restart, and CLI interface |
tests/test_model_lifecycle.py |
Comprehensive test suite with 63 tests covering dataclass serialization, database CRUD operations, service generation for all backends, lifecycle management, health checking, and edge cases |
README_MODEL_LIFECYCLE.md |
User documentation with quick start guide, command reference, usage examples for all backends, configuration details for resources/security/health checks, and architecture overview |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Color-code state | ||
| state_str = state.value | ||
| if state == ModelState.ACTIVE: | ||
| state_str = f"\033[32m{state_str}\033[0m" # Green | ||
| elif state == ModelState.FAILED: | ||
| state_str = f"\033[31m{state_str}\033[0m" # Red | ||
|
|
||
| print(f"{m.name:<20} {state_str:<21} {enabled:<8} {m.backend:<10} {m.port:<6}") |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status output uses ANSI color codes directly (lines 808-810) but doesn't check if the output is a TTY. When piping output or redirecting to a file, the color codes will appear as raw escape sequences, making the output harder to read. Consider checking sys.stdout.isatty() before applying colors or using a library like colorama for cross-platform color support.
| ## Supported Backends | ||
|
|
||
| | Backend | Command | Health Endpoint | | ||
| |---------|---------|-----------------|| |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table header format in the documentation (line 37) has an extra pipe character at the end. The correct markdown table format should be:
| Backend | Command | Health Endpoint |
|---------|---------|-----------------|
Without the trailing | on the separator line.
| |---------|---------|-----------------|| | |
| |---------|---------|-----------------| |
| @@ -0,0 +1,907 @@ | |||
| #!/usr/bin/env python3 | |||
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test file location inconsistency: There's already a test file at tests/kernel_features/test_model_lifecycle.py (15 lines), but this PR creates a new test file at tests/test_model_lifecycle.py (907 lines). This creates duplicate test files with the same name in different locations, which could cause confusion. Consider consolidating these tests or clarifying the intended structure.
| if self._failure_counts[name] >= hc.max_failures: | ||
| self.manager.db.log_event(name, EventType.AUTO_RESTARTED) | ||
| self.manager.restart(name, log_event=False) | ||
| self._failure_counts[name] = 0 | ||
| time.sleep(hc.startup_delay_seconds) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage: The health monitoring auto-restart functionality (lines 534-538) is a critical feature but lacks comprehensive integration testing. While there are unit tests for individual health check methods, there's no test verifying the complete flow: health check fails → failure count increases → max failures reached → auto-restart triggered. Consider adding an integration test for this critical path.
| """Generate health check watchdog configuration.""" | ||
| if not config.health_check.enabled: | ||
| return "" | ||
|
|
||
| hc = config.health_check | ||
| # Use systemd watchdog for health monitoring | ||
| return f""" | ||
| # Health check via systemd watchdog | ||
| WatchdogSec={hc.interval_seconds} | ||
| """ | ||
|
|
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The watchdog health check configuration is generated but systemd's WatchdogSec requires the service to actively send watchdog signals (via sd_notify). The current implementation doesn't send these signals, so this watchdog configuration won't actually monitor health. Either implement proper sd_notify integration or remove this misleading configuration and rely solely on the Python-based health checker.
| """Generate health check watchdog configuration.""" | |
| if not config.health_check.enabled: | |
| return "" | |
| hc = config.health_check | |
| # Use systemd watchdog for health monitoring | |
| return f""" | |
| # Health check via systemd watchdog | |
| WatchdogSec={hc.interval_seconds} | |
| """ | |
| """Health monitoring is handled by Python, not systemd watchdog.""" | |
| return "" |
| except Exception as e: | ||
| self.db.log_event(config.name, EventType.ERROR, str(e)) | ||
| print(f"Failed to register '{config.name}': {e}") | ||
| return False |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling in the register method catches all exceptions generically (line 602), which makes debugging more difficult. The same pattern is used in unregister (line 635). Consider catching specific exception types (e.g., IOError, OSError, subprocess.CalledProcessError) separately to provide more informative error messages, or at least include the exception type in the log message.
| if follow: | ||
| args.append("-f") | ||
|
|
||
| subprocess.run(args) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logs method (lines 814-820) directly calls subprocess.run() without handling potential exceptions. If journalctl is not available or the command fails, this will raise an unhandled exception. Consider adding error handling and a user-friendly error message.
| subprocess.run(args) | |
| try: | |
| subprocess.run(args, check=True) | |
| except FileNotFoundError: | |
| print("Error: 'journalctl' command not found. Please ensure systemd and journalctl are installed.") | |
| except subprocess.CalledProcessError as e: | |
| print(f"Error: Failed to retrieve logs for '{name}'.\nDetails: {e}") |
| - CLI parsing | ||
| """ | ||
|
|
||
| import os |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'os' is not used.
| import os |
| import unittest | ||
| from pathlib import Path | ||
| from unittest.mock import patch, MagicMock | ||
| from dataclasses import asdict |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'asdict' is not used.
| from dataclasses import asdict |
| ModelDatabase, | ||
| ServiceGenerator, | ||
| ModelLifecycleManager, | ||
| HealthChecker, |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'HealthChecker' is not used.
| HealthChecker, |


Summary
Complete implementation of issue #220 - Model Lifecycle Manager ($150 bounty).
Brings "systemctl for AI models" to Cortex Linux with:
Usage
Files Changed
cortex/kernel_features/model_lifecycle.pytests/test_model_lifecycle.pyREADME_MODEL_LIFECYCLE.mdTest Results
Acceptance Criteria
cortex model registercreates valid systemd servicecortex model start/stopworks reliablyCloses #220
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.