feat: Implement Model Lifecycle Manager - Systemd-based LLM service management #234

yaya1738 · 2025-12-04T05:56:27Z

Summary

Complete implementation of issue #220 - Model Lifecycle Manager ($150 bounty).

Brings "systemctl for AI models" to Cortex Linux with:

Systemd service generation for any LLM backend
Multi-backend support: vLLM, llama.cpp, Ollama, TGI
Health check monitoring with auto-restart on failure
Resource limits via systemd (CPU, memory, I/O, tasks)
Security hardening (NoNewPrivileges, ProtectSystem, etc.)
SQLite persistence for configuration and event logging
Boot auto-start via systemd enable

Usage

# Register a model
cortex-model register llama-70b --path meta-llama/Llama-2-70b-hf --backend vllm --gpus 0,1

# Start/stop/restart
cortex-model start llama-70b
cortex-model stop llama-70b

# Enable auto-start on boot
cortex-model enable llama-70b

# View status and logs
cortex-model status
cortex-model logs llama-70b -f

Files Changed

File	Lines	Description
`cortex/kernel_features/model_lifecycle.py`	993	Main implementation
`tests/test_model_lifecycle.py`	907	63 unit tests
`README_MODEL_LIFECYCLE.md`	297	Documentation

Test Results

============================= test session starts ==============================
collected 63 items
tests/test_model_lifecycle.py ............................ [100%]
============================== 63 passed in 0.79s ==============================

Acceptance Criteria

cortex model register creates valid systemd service
cortex model start/stop works reliably
Health checks trigger auto-restart on failure
Unit tests pass with >80% coverage (63 tests)

Closes #220

Summary by CodeRabbit

New Features
- Added Model Lifecycle Manager with systemd service integration for model deployment and lifecycle management
- Support for register, start, stop, restart, enable/disable workflows via CLI
- Health monitoring with automatic restart on service failures
- Persistent storage for model configurations and operational events
- Multi-backend support (vLLM, Ollama, LlamaCpp, TGI) with advanced configuration options
- Service status reporting and log access capabilities
Documentation
- Added comprehensive Model Lifecycle documentation with quick start guide and architecture overview

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…anagement Complete implementation of issue cortexlinux#220 ($150 bounty): Features: - Systemd service generation for any LLM backend - Multi-backend support (vLLM, llama.cpp, Ollama, TGI) - Health check monitoring with auto-restart on failure - Resource limits via systemd (CPU, memory, I/O, tasks) - Security hardening (NoNewPrivileges, ProtectSystem, etc.) - SQLite database for configuration persistence - Event logging for audit trail - Boot auto-start via systemd enable Files: - model_lifecycle.py: 993 lines of implementation - test_model_lifecycle.py: 907 lines, 63 tests (all passing) - README_MODEL_LIFECYCLE.md: Complete documentation Closes cortexlinux#220

coderabbitai · 2025-12-04T05:56:35Z

Walkthrough

Introduces a comprehensive Model Lifecycle Manager subsystem enabling systemd-based LLM service management for Cortex. Adds persistent SQLite storage, multi-backend support (vLLM, llama.cpp, TGI, Ollama), health monitoring with auto-restart, resource configuration, and rich CLI commands for model registration, start/stop, enable/disable workflows.

Changes

Cohort / File(s)	Summary
Documentation `README_MODEL_LIFECYCLE.md`	New documentation file introducing systemd-based model lifecycle management, including quick start, supported backends, register/start/stop/restart workflows, status monitoring, usage examples, architecture overview, service files, and testing guidance.
Core Implementation `cortex/kernel_features/model_lifecycle.py`	Comprehensive model lifecycle subsystem: introduces `ModelState` and `EventType` enums; adds nested configuration dataclasses (`HealthCheckConfig`, `ResourceLimits`, `SecurityConfig`); implements `ModelDatabase` for SQLite-backed persistence; creates `ServiceGenerator` for multi-backend systemd service generation; implements `HealthChecker` for HTTP health monitoring with auto-restart; adds `ModelLifecycleManager` orchestrating registration, start/stop/restart, enable/disable, status, logs, and events; expands CLI with rich command handlers.
Test Suite `tests/test_model_lifecycle.py`	Comprehensive test suite covering dataclass validation/serialization, database CRUD operations and event logging, service generation across backends with health endpoints and resource/security settings, lifecycle manager operations, health checker logic, enum coverage, configuration round-trips, and edge cases.

Sequence Diagram

sequenceDiagram
    participant User as User/CLI
    participant Manager as ModelLifecycleManager
    participant DB as ModelDatabase
    participant Gen as ServiceGenerator
    participant Systemd as systemd
    participant HC as HealthChecker
    participant Service as LLM Service

    rect rgb(200, 220, 255)
    note over User,Systemd: Model Registration & Start Flow
    User->>Manager: register(config)
    Manager->>DB: save_model(config)
    DB-->>Manager: ✓ saved
    Manager->>Gen: generate(config)
    Gen-->>Manager: systemd unit file
    Manager->>Systemd: write unit + daemon-reload
    Systemd-->>Manager: ✓ loaded
    Manager->>DB: log_event(REGISTERED)
    DB-->>Manager: ✓ logged
    Manager-->>User: registration complete
    
    User->>Manager: start(name)
    Manager->>Systemd: systemctl start service
    Systemd->>Service: launch LLM backend
    Service-->>Systemd: running
    Systemd-->>Manager: ✓ started
    Manager->>DB: log_event(STARTED)
    Manager-->>User: ✓ started
    end
    
    rect rgb(220, 240, 220)
    note over HC,Service: Health Monitoring & Auto-Restart
    HC->>HC: monitor interval elapsed
    HC->>Service: GET /health
    Service-->>HC: 200 OK
    HC->>DB: log_event(HEALTH_CHECK_PASSED)
    DB-->>HC: ✓ logged
    
    HC->>Service: GET /health (later)
    Service-->>HC: timeout/error
    HC->>HC: increment failure count
    alt Failure threshold reached
        HC->>Systemd: systemctl restart service
        Systemd->>Service: restart LLM backend
        Service-->>Systemd: running (restarted)
        HC->>DB: log_event(AUTO_RESTARTED)
    end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

ModelDatabase SQLite integration: Verify schema, query correctness, transaction handling, and serialization/deserialization logic for nested dataclasses.
ServiceGenerator backend support: Review correctness of generated systemd ExecStart commands for each backend (vLLM, llamacpp, TGI, Ollama), resource limit blocks, security hardening, and health endpoint mappings.
HealthChecker state machine: Validate HTTP health check logic, failure counting, restart triggering, and integration with the lifecycle manager to prevent race conditions.
ModelLifecycleManager orchestration: Examine event logging consistency across all operations, systemd service file lifecycle (creation, deletion, reload), state transitions, and error handling paths.
CLI command wiring: Ensure all CLI arguments are properly parsed, nested configurations constructed correctly, and output formatting is user-friendly.
Test coverage: Verify test suite exercises happy paths, error conditions, edge cases (special characters, multi-GPU), and interactions between components.

Possibly related PRs

Add Kernel-Level AI Features (Tier 1) #224 — Modifies the same model lifecycle implementation files and exports related classes, likely addressing parallel or related functionality.

Suggested labels

enhancement, kernel-features

Poem

🐰 Hoppy systemd hops around,
Models service-d with joyful sound,
Auto-restarts when health checks fail,
SQLite stores each tiny tale,
AI services now have class—
"cortex model start" does pass! ✨

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: implementing a Model Lifecycle Manager with systemd-based service management for LLM models, matching the PR's core objective.
Description check	✅ Passed	The description includes a summary, usage examples, files changed, test results, acceptance criteria, and closes statement, though it doesn't explicitly follow the template's checklist format.
Linked Issues check	✅ Passed	The PR implementation comprehensively addresses all requirements from issue #220: systemd service generation, multi-backend support (vLLM, llama.cpp, Ollama, TGI), health checks with auto-restart, resource limits, security hardening, SQLite persistence, and 63 passing unit tests.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue #220: the main implementation file, comprehensive test suite, and documentation are all focused on the Model Lifecycle Manager feature without unrelated modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2025-12-04T05:57:02Z

Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (8)

README_MODEL_LIFECYCLE.md (1)
188-206: Add language specifier to fenced code block.

The architecture diagram code block lacks a language specifier. While this is plaintext, adding a specifier improves rendering consistency.
-```
+```text
 ModelLifecycleManager
tests/test_model_lifecycle.py (1)
355-361: Rename unused loop variable to _.

The loop variable i is not used within the loop body. Per Python convention, rename it to _ to indicate it's intentionally unused.
-        for i in range(10):
+        for _ in range(10):
             self.db.log_event("test", EventType.STARTED)
cortex/kernel_features/model_lifecycle.py (6)
239-246: Use explicit Optional type hint.

PEP 484 prohibits implicit Optional. The details parameter should be explicitly typed.
-    def log_event(self, model_name: str, event_type: EventType, details: str = None) -> None:
+    def log_event(self, model_name: str, event_type: EventType, details: Optional[str] = None) -> None:
248-267: Use explicit Optional type hint for model_name parameter.
-    def get_events(self, model_name: str = None, limit: int = 100) -> List[Dict[str, Any]]:
+    def get_events(self, model_name: Optional[str] = None, limit: int = 100) -> List[Dict[str, Any]]:
270-317: Annotate class-level mutable attributes with ClassVar.

The BACKENDS and HEALTH_ENDPOINTS dictionaries are class-level constants and should be annotated with ClassVar to clarify they are not instance attributes.
+from typing import ClassVar
+
 class ServiceGenerator:
     """Generate systemd service files for LLM backends."""
 
     # Backend command templates
-    BACKENDS = {
+    BACKENDS: ClassVar[Dict[str, str]] = {
         ...
     }
 
     # Health check endpoints by backend
-    HEALTH_ENDPOINTS = {
+    HEALTH_ENDPOINTS: ClassVar[Dict[str, str]] = {
         ...
     }
331-336: Deduplicate identical branches for dtype handling.

The vllm and tgi branches have identical logic for dtype argument construction.
         dtype_arg = ""
         if config.dtype != "auto":
-            if config.backend == "vllm":
-                dtype_arg = f"--dtype {config.dtype}"
-            elif config.backend == "tgi":
+            if config.backend in ("vllm", "tgi"):
                 dtype_arg = f"--dtype {config.dtype}"
502-544: High cognitive complexity in monitor_loop - consider refactoring.

The nested monitor_loop closure has high cognitive complexity (17 vs allowed 15 per SonarCloud). Consider extracting the health check result handling into a separate method.
def _handle_health_result(self, name: str, healthy: bool, msg: str, hc: HealthCheckConfig) -> float:
    """Handle health check result. Returns sleep duration."""
    if healthy:
        if self._failure_counts.get(name, 0) > 0:
            self.manager.db.log_event(name, EventType.HEALTH_CHECK_PASSED, msg)
        self._failure_counts[name] = 0
        return hc.interval_seconds
    
    self._failure_counts[name] = self._failure_counts.get(name, 0) + 1
    self.manager.db.log_event(
        name, EventType.HEALTH_CHECK_FAILED,
        f"Failure {self._failure_counts[name]}/{hc.max_failures}: {msg}"
    )
    
    if self._failure_counts[name] >= hc.max_failures:
        self.manager.db.log_event(name, EventType.AUTO_RESTARTED)
        self.manager.restart(name, log_event=False)
        self._failure_counts[name] = 0
        return hc.startup_delay_seconds
    
    return hc.interval_seconds
600-605: Narrow the exception handler to specific exceptions.

Catching bare Exception can mask programming errors. Consider catching specific exceptions like OSError, sqlite3.Error, etc.
-        except Exception as e:
+        except (OSError, sqlite3.Error, subprocess.SubprocessError) as e:
             self.db.log_event(config.name, EventType.ERROR, str(e))
             print(f"Failed to register '{config.name}': {e}")
             return False
Apply similar changes to unregister() at lines 635-638.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da3e635 and ea0f124.

📒 Files selected for processing (3)

README_MODEL_LIFECYCLE.md (1 hunks)
cortex/kernel_features/model_lifecycle.py (1 hunks)
tests/test_model_lifecycle.py (1 hunks)

🧰 Additional context used

🪛 GitHub Check: SonarCloud Code Analysis

tests/test_model_lifecycle.py

[warning] 357-357: Replace the unused loop index "i" with "_".

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrn7288Tmn1wGm3II4E&open=AZrn7288Tmn1wGm3II4E&pullRequest=234

cortex/kernel_features/model_lifecycle.py

[failure] 68-68: Define a constant instead of duplicating this literal "/health" 5 times.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrn7261Tmn1wGm3II39&open=AZrn7261Tmn1wGm3II39&pullRequest=234

[warning] 336-336: Either merge this branch with the identical one on line "334" or change one of the implementations.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrn7261Tmn1wGm3II4B&open=AZrn7261Tmn1wGm3II4B&pullRequest=234

[failure] 860-860: Define a constant instead of duplicating this literal "Model name" 9 times.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrn7261Tmn1wGm3II3-&open=AZrn7261Tmn1wGm3II3-&pullRequest=234

[failure] 502-502: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrn7261Tmn1wGm3II4C&open=AZrn7261Tmn1wGm3II4C&pullRequest=234

[failure] 839-839: Refactor this function to reduce its Cognitive Complexity from 21 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrn7261Tmn1wGm3II4D&open=AZrn7261Tmn1wGm3II4D&pullRequest=234

🪛 markdownlint-cli2 (0.18.1)

README_MODEL_LIFECYCLE.md

188-188: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.7)

tests/test_model_lifecycle.py

1-1: Shebang is present but file is not executable

(EXE001)

182-182: Possible binding to all interfaces

(S104)

357-357: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

541-541: Unused method argument: mock_run

(ARG002)

756-756: Possible binding to all interfaces

(S104)

cortex/kernel_features/model_lifecycle.py

239-239: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

248-248: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

274-309: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

312-317: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

492-492: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

493-493: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

499-499: Do not catch blind exception: Exception

(BLE001)

560-560: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

569-569: subprocess call: check for execution of untrusted input

(S603)

570-570: Consider ["systemctl", "--user", *list(args)] instead of concatenation

Replace with ["systemctl", "--user", *list(args)]

(RUF005)

601-601: Consider moving this statement to an else block

(TRY300)

602-602: Do not catch blind exception: Exception

(BLE001)

634-634: Consider moving this statement to an else block

(TRY300)

635-635: Do not catch blind exception: Exception

(BLE001)

782-782: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

820-820: subprocess call: check for execution of untrusted input

(S603)

822-822: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Agent

🔇 Additional comments (8)

tests/test_model_lifecycle.py (3)

540-544: Unused mock_run parameter is intentional.

The mock_run parameter appears unused but is necessary to prevent actual subprocess.run calls during the test. The @patch decorator requires a corresponding parameter. This is correct.

656-712: Good coverage of health check scenarios.

The HealthChecker tests cover success, bad status codes, and connection errors with proper mocking of urllib.request.urlopen. The context manager mock pattern is correctly implemented.

746-831: Thorough serialization round-trip testing.

The configuration serialization tests verify nested dataclass handling and JSON compatibility. This ensures database persistence integrity.
cortex/kernel_features/model_lifecycle.py (5)
64-79: Well-designed dataclass with serialization support.

The HealthCheckConfig dataclass provides sensible defaults and proper to_dict/from_dict methods for persistence. The field filtering in from_dict handles forward compatibility gracefully.

173-237: SQLite database implementation is solid.

Good use of:

UTC timestamps for consistency

Upsert pattern with ON CONFLICT

Index creation for query performance

Parameterized queries preventing SQL injection

431-470: Service file generation is well-structured.

The generated systemd unit file includes proper sections for resources, security hardening, restart policies, and logging. The template approach is clean and maintainable.

814-820: Subprocess call in logs() should use explicit argument list.

The subprocess.run() call passes args directly. While the list is constructed internally (safe from injection), the name parameter should be validated to ensure it doesn't contain malicious characters before being passed to _service_name().

567-577: Validate model names to prevent systemd service naming issues.

Model names are passed to systemctl via subprocess without validation. While subprocess.run with a list avoids shell injection, special characters in service names could cause issues with systemd service registration and management.

Consider adding name validation in register() or ModelConfig to ensure model names conform to systemd service naming requirements:
import re

def _validate_name(name: str) -> bool:
    """Validate model name for safe systemd service naming."""
    # Allow alphanumeric, dash, underscore only
    return bool(re.match(r'^[a-zA-Z0-9][a-zA-Z0-9_-]*$', name))

coderabbitai · 2025-12-04T06:00:05Z

cortex/kernel_features/model_lifecycle.py

+    def check_health(self, config: ModelConfig) -> Tuple[bool, str]:
+        """Perform a single health check."""
+        url = config.get_health_url()
+        timeout = config.health_check.timeout_seconds
+
+        try:
+            req = urllib.request.Request(url, method='GET')
+            with urllib.request.urlopen(req, timeout=timeout) as response:
+                if response.status == 200:
+                    return True, "OK"
+                return False, f"HTTP {response.status}"
+        except urllib.error.URLError as e:
+            return False, f"Connection failed: {e.reason}"
+        except Exception as e:
+            return False, str(e)


⚠️ Potential issue | 🟠 Major

Validate URL scheme before opening to prevent SSRF.

The check_health method opens URLs constructed from user-provided host values without validating the scheme. This could allow file:// or other dangerous schemes if a malicious host value is provided.

def check_health(self, config: ModelConfig) -> Tuple[bool, str]: """Perform a single health check.""" url = config.get_health_url() + # Validate URL scheme to prevent SSRF + if not url.startswith(('http://', 'https://')): + return False, f"Invalid URL scheme: {url}" timeout = config.health_check.timeout_seconds try: req = urllib.request.Request(url, method='GET') with urllib.request.urlopen(req, timeout=timeout) as response: if response.status == 200: return True, "OK" return False, f"HTTP {response.status}" except urllib.error.URLError as e: return False, f"Connection failed: {e.reason}" - except Exception as e: - return False, str(e) + except (TimeoutError, OSError) as e: + return False, f"Connection error: {e}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def check_health(self, config: ModelConfig) -> Tuple[bool, str]:

"""Perform a single health check."""

url = config.get_health_url()

timeout = config.health_check.timeout_seconds

try:

req = urllib.request.Request(url, method='GET')

with urllib.request.urlopen(req, timeout=timeout) as response:

if response.status == 200:

return True, "OK"

return False, f"HTTP {response.status}"

except urllib.error.URLError as e:

return False, f"Connection failed: {e.reason}"

except Exception as e:

return False, str(e)

def check_health(self, config: ModelConfig) -> Tuple[bool, str]:

"""Perform a single health check."""

url = config.get_health_url()

# Validate URL scheme to prevent SSRF

if not url.startswith(('http://', 'https://')):

return False, f"Invalid URL scheme: {url}"

timeout = config.health_check.timeout_seconds

try:

req = urllib.request.Request(url, method='GET')

with urllib.request.urlopen(req, timeout=timeout) as response:

if response.status == 200:

return True, "OK"

return False, f"HTTP {response.status}"

except urllib.error.URLError as e:

return False, f"Connection failed: {e.reason}"

except (TimeoutError, OSError) as e:

return False, f"Connection error: {e}"

🧰 Tools

🪛 Ruff (0.14.7)

492-492: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

493-493: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

499-499: Do not catch blind exception: Exception

(BLE001)

coderabbitai · 2025-12-04T06:00:05Z

cortex/kernel_features/model_lifecycle.py

+    if args.command == "register":
+        config = ModelConfig(
+            name=args.name,
+            model_path=args.path,
+            backend=args.backend,
+            port=args.port,
+            host=args.host,
+            gpu_ids=[int(x) for x in args.gpus.split(",")],
+            max_model_len=args.max_model_len,
+            tensor_parallel_size=args.tensor_parallel,
+            quantization=args.quantization,
+            extra_args=args.extra_args,
+            health_check=HealthCheckConfig(
+                enabled=not args.no_health_check,
+                endpoint=ServiceGenerator().get_default_health_endpoint(args.backend)
+            ),
+            resources=ResourceLimits(
+                memory_max=args.memory,
+                cpu_quota=args.cpu
+            )
+        )


⚠️ Potential issue | 🟡 Minor

GPU ID parsing lacks validation.

The GPU ID parsing splits on comma and converts to int without validation. Invalid input like "0,abc,1" will raise an unhandled ValueError.

+def parse_gpu_ids(gpus_str: str) -> List[int]: + """Parse comma-separated GPU IDs with validation.""" + try: + ids = [int(x.strip()) for x in gpus_str.split(",") if x.strip()] + if not ids: + raise ValueError("No valid GPU IDs provided") + if any(id < 0 for id in ids): + raise ValueError("GPU IDs must be non-negative") + return ids + except ValueError as e: + raise argparse.ArgumentTypeError(f"Invalid GPU IDs '{gpus_str}': {e}") # In register command setup: -reg.add_argument("--gpus", default="0", help="Comma-separated GPU IDs") +reg.add_argument("--gpus", default="0", type=parse_gpu_ids, help="Comma-separated GPU IDs")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cortex/kernel_features/model_lifecycle.py around lines 930 to 950, the GPU ID parsing (gpu_ids=[int(x) for x in args.gpus.split(",")]) lacks validation and will raise on invalid entries; change it to robustly parse by splitting on commas, stripping whitespace, ignoring empty entries, and validating each token is a non-negative integer (or raise a clear ValueError with a helpful message), or alternatively collect only valid ints and log/warn about skipped invalid tokens; ensure the final gpu_ids is a list of ints and handle the case of an empty args.gpus string.

coderabbitai · 2025-12-04T06:00:06Z

README_MODEL_LIFECYCLE.md

+| Backend | Command | Health Endpoint |
+|---------|---------|-----------------||
+| vLLM | `python -m vllm.entrypoints.openai.api_server` | `/health` |
+| llama.cpp | `llama-server` | `/health` |
+| Ollama | `ollama serve` | `/api/tags` |
+| TGI | `text-generation-launcher` | `/health` |
+


⚠️ Potential issue | 🟡 Minor

Malformed Markdown table syntax.

Line 37 has a double pipe || at the end of the header separator row which will break table rendering in most Markdown processors.

-|---------|---------|-----------------|| +|---------|---------|-----------------|

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

| Backend | Command | Health Endpoint |

|---------|---------|-----------------||

| vLLM | `python -m vllm.entrypoints.openai.api_server` | `/health` |

| llama.cpp | `llama-server` | `/health` |

| Ollama | `ollama serve` | `/api/tags` |

| TGI | `text-generation-launcher` | `/health` |

| Backend | Command | Health Endpoint |

|---------|---------|-----------------|

| vLLM | `python -m vllm.entrypoints.openai.api_server` | `/health` |

| llama.cpp | `llama-server` | `/health` |

| Ollama | `ollama serve` | `/api/tags` |

| TGI | `text-generation-launcher` | `/health` |

🤖 Prompt for AI Agents

In README_MODEL_LIFECYCLE.md around lines 36 to 42, the Markdown table header separator row contains an extra trailing pipe ("||") which breaks table rendering; remove the extra pipe so the separator row has exactly three columns (e.g., |---------|---------|-----------------|) and verify all data rows have three pipe-separated columns to restore correct table formatting.

coderabbitai · 2025-12-04T06:00:06Z

README_MODEL_LIFECYCLE.md

+- `cortex/kernel_features/model_lifecycle.py` - Main implementation (~1000 lines)
+- `tests/test_model_lifecycle.py` - Unit tests (~650 lines, 50+ tests)
+- `README_MODEL_LIFECYCLE.md` - This documentation


⚠️ Potential issue | 🟡 Minor

Documentation line counts are outdated.

The test file is described as "~650 lines, 50+ tests" but the actual file has 907 lines and 63 tests according to the PR description. Consider updating to reflect current state.

-- `tests/test_model_lifecycle.py` - Unit tests (~650 lines, 50+ tests) +- `tests/test_model_lifecycle.py` - Unit tests (~900 lines, 63 tests)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- `cortex/kernel_features/model_lifecycle.py` - Main implementation (~1000 lines)

- `tests/test_model_lifecycle.py` - Unit tests (~650 lines, 50+ tests)

- `README_MODEL_LIFECYCLE.md` - This documentation

- `cortex/kernel_features/model_lifecycle.py` - Main implementation (~1000 lines)

- `tests/test_model_lifecycle.py` - Unit tests (~900 lines, 63 tests)

- `README_MODEL_LIFECYCLE.md` - This documentation

🤖 Prompt for AI Agents

In README_MODEL_LIFECYCLE.md around lines 291 to 293, the test file line counts are stale; update the bullet to reflect the current test file size and count by replacing "~650 lines, 50+ tests" with the current values (907 lines, 63 tests) so the entry reads "- `tests/test_model_lifecycle.py` - Unit tests (~907 lines, 63 tests)". Ensure formatting and punctuation match the surrounding list.

Copilot

Pull request overview

This PR implements a comprehensive Model Lifecycle Manager that brings systemd-based service management to LLM models in Cortex Linux. The implementation provides a production-ready solution for managing AI models as first-class system services with health monitoring, resource limits, and security hardening.

Key Changes

Core Implementation: 993-line model lifecycle manager with SQLite persistence, systemd service generation for 4 backends (vLLM, llama.cpp, Ollama, TGI), and HTTP health check monitoring
Comprehensive Testing: 907-line test suite with 63 unit tests covering all major functionality including configuration, database operations, service generation, and lifecycle management
Documentation: Complete user guide with usage examples, architecture overview, and configuration reference

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.

File	Description
`cortex/kernel_features/model_lifecycle.py`	Main implementation providing ModelConfig dataclasses, SQLite database layer, systemd service file generation, lifecycle operations (register/start/stop/enable), health monitoring with auto-restart, and CLI interface
`tests/test_model_lifecycle.py`	Comprehensive test suite with 63 tests covering dataclass serialization, database CRUD operations, service generation for all backends, lifecycle management, health checking, and edge cases
`README_MODEL_LIFECYCLE.md`	User documentation with quick start guide, command reference, usage examples for all backends, configuration details for resources/security/health checks, and architecture overview

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-04T06:03:34Z

cortex/kernel_features/model_lifecycle.py

+                # Color-code state
+                state_str = state.value
+                if state == ModelState.ACTIVE:
+                    state_str = f"\033[32m{state_str}\033[0m"  # Green
+                elif state == ModelState.FAILED:
+                    state_str = f"\033[31m{state_str}\033[0m"  # Red
+
+                print(f"{m.name:<20} {state_str:<21} {enabled:<8} {m.backend:<10} {m.port:<6}")


The status output uses ANSI color codes directly (lines 808-810) but doesn't check if the output is a TTY. When piping output or redirecting to a file, the color codes will appear as raw escape sequences, making the output harder to read. Consider checking sys.stdout.isatty() before applying colors or using a library like colorama for cross-platform color support.

Copilot · 2025-12-04T06:03:35Z

README_MODEL_LIFECYCLE.md

+## Supported Backends
+
+| Backend | Command | Health Endpoint |
+|---------|---------|-----------------||


The table header format in the documentation (line 37) has an extra pipe character at the end. The correct markdown table format should be:

| Backend | Command | Health Endpoint | |---------|---------|-----------------|

Without the trailing | on the separator line.

Suggested change

|---------|---------|-----------------||

|---------|---------|-----------------|

Copilot · 2025-12-04T06:03:35Z

tests/test_model_lifecycle.py

@@ -0,0 +1,907 @@
+#!/usr/bin/env python3


Test file location inconsistency: There's already a test file at tests/kernel_features/test_model_lifecycle.py (15 lines), but this PR creates a new test file at tests/test_model_lifecycle.py (907 lines). This creates duplicate test files with the same name in different locations, which could cause confusion. Consider consolidating these tests or clarifying the intended structure.

Copilot · 2025-12-04T06:03:35Z

cortex/kernel_features/model_lifecycle.py

+                    if self._failure_counts[name] >= hc.max_failures:
+                        self.manager.db.log_event(name, EventType.AUTO_RESTARTED)
+                        self.manager.restart(name, log_event=False)
+                        self._failure_counts[name] = 0
+                        time.sleep(hc.startup_delay_seconds)


Missing test coverage: The health monitoring auto-restart functionality (lines 534-538) is a critical feature but lacks comprehensive integration testing. While there are unit tests for individual health check methods, there's no test verifying the complete flow: health check fails → failure count increases → max failures reached → auto-restart triggered. Consider adding an integration test for this critical path.

Copilot · 2025-12-04T06:03:36Z

cortex/kernel_features/model_lifecycle.py

+        """Generate health check watchdog configuration."""
+        if not config.health_check.enabled:
+            return ""
+
+        hc = config.health_check
+        # Use systemd watchdog for health monitoring
+        return f"""
+# Health check via systemd watchdog
+WatchdogSec={hc.interval_seconds}
+"""
+


The watchdog health check configuration is generated but systemd's WatchdogSec requires the service to actively send watchdog signals (via sd_notify). The current implementation doesn't send these signals, so this watchdog configuration won't actually monitor health. Either implement proper sd_notify integration or remove this misleading configuration and rely solely on the Python-based health checker.

Suggested change

"""Generate health check watchdog configuration."""

if not config.health_check.enabled:

return ""

hc = config.health_check

# Use systemd watchdog for health monitoring

return f"""

# Health check via systemd watchdog

WatchdogSec={hc.interval_seconds}

"""

"""Health monitoring is handled by Python, not systemd watchdog."""

return ""

Copilot · 2025-12-04T06:03:38Z

cortex/kernel_features/model_lifecycle.py

+        except Exception as e:
+            self.db.log_event(config.name, EventType.ERROR, str(e))
+            print(f"Failed to register '{config.name}': {e}")
+            return False


The error handling in the register method catches all exceptions generically (line 602), which makes debugging more difficult. The same pattern is used in unregister (line 635). Consider catching specific exception types (e.g., IOError, OSError, subprocess.CalledProcessError) separately to provide more informative error messages, or at least include the exception type in the log message.

Copilot · 2025-12-04T06:03:38Z

cortex/kernel_features/model_lifecycle.py

+        if follow:
+            args.append("-f")
+
+        subprocess.run(args)


The logs method (lines 814-820) directly calls subprocess.run() without handling potential exceptions. If journalctl is not available or the command fails, this will raise an unhandled exception. Consider adding error handling and a user-friendly error message.

Suggested change

subprocess.run(args)

try:

subprocess.run(args, check=True)

except FileNotFoundError:

print("Error: 'journalctl' command not found. Please ensure systemd and journalctl are installed.")

except subprocess.CalledProcessError as e:

print(f"Error: Failed to retrieve logs for '{name}'.\nDetails: {e}")

Copilot · 2025-12-04T06:03:39Z

tests/test_model_lifecycle.py

+- CLI parsing
+"""
+
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-12-04T06:03:39Z

tests/test_model_lifecycle.py

+import unittest
+from pathlib import Path
+from unittest.mock import patch, MagicMock
+from dataclasses import asdict


Import of 'asdict' is not used.

Suggested change

from dataclasses import asdict

Copilot · 2025-12-04T06:03:39Z

tests/test_model_lifecycle.py

+    ModelDatabase,
+    ServiceGenerator,
+    ModelLifecycleManager,
+    HealthChecker,


Import of 'HealthChecker' is not used.

Suggested change

HealthChecker,

Copilot AI review requested due to automatic review settings December 4, 2025 05:56

yaya1738 requested a review from mikejmorgan-ai as a code owner December 4, 2025 05:56

Copilot started reviewing on behalf of yaya1738 December 4, 2025 05:56 View session

Copilot finished reviewing on behalf of yaya1738 December 4, 2025 05:59

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Sahilbhatane closed this Dec 4, 2025

	\|---------\|---------\|-----------------\|\|
	\|---------\|---------\|-----------------\|

-        subprocess.run(args)
+        try:
+            subprocess.run(args, check=True)
+        except FileNotFoundError:
+            print("Error: 'journalctl' command not found. Please ensure systemd and journalctl are installed.")
+        except subprocess.CalledProcessError as e:
+            print(f"Error: Failed to retrieve logs for '{name}'.\nDetails: {e}")

Uh oh!

feat: Implement Model Lifecycle Manager - Systemd-based LLM service management #234

feat: Implement Model Lifecycle Manager - Systemd-based LLM service management #234

Uh oh!

Conversation

yaya1738 commented Dec 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Files Changed

Test Results

Acceptance Criteria

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks and finishing touches

Uh oh!

sonarqubecloud bot commented Dec 4, 2025

Quality Gate failed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

yaya1738 commented Dec 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2025 •

edited

Loading