Skip to content

Conversation

@yaya1738
Copy link

@yaya1738 yaya1738 commented Dec 4, 2025

Summary

Completes the implementation of accelerator-aware resource limits for AI workloads.

Closes #222

Changes

  • Full cgroups v2 integration with unified hierarchy support
  • Workload presets (inference, training, batch, interactive)
  • CPU quota/weight controls with proper cgroup interface
  • Memory hard (max) and soft (high) limits
  • GPU isolation via CUDA_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, ONEAPI_DEVICE_SELECTOR
  • OOM score adjustment for process priority protection
  • User mode delegation support for non-root usage
  • SQLite-based profile storage
  • Complete CLI with create/apply/env/status/list/delete/presets commands

Files

  • cortex/kernel_features/accelerator_limits.py (718 lines) - Complete implementation
  • tests/test_accelerator_limits.py (394 lines) - 32 unit tests
  • README_ACCELERATOR_LIMITS.md - Documentation

Acceptance Criteria

  • Profiles can be created with presets
  • Limits are correctly applied to cgroups
  • GPU environment variables are generated
  • Works in user mode with delegation
  • Unit tests pass (32/32)

Test Plan

pytest tests/test_accelerator_limits.py -v
# All 32 tests pass

Usage Examples

cortex limits create inference-job --preset inference --gpus 2
cortex limits apply inference-job --pid 12345
eval $(cortex limits env inference-job)
cortex limits status

Summary by CodeRabbit

  • New Features

    • Added accelerator-limits functionality to manage GPU, CPU, and memory resources.
    • Introduced workload presets: inference, training, batch, and interactive.
    • New CLI commands: create, apply, env, status, delete, and presets for resource profile management.
    • Support for environment variable generation and non-root user delegation.
  • Documentation

    • Added accelerator-limits documentation with quick-start guide and architecture overview.

✏️ Tip: You can customize this high-level summary in your review settings.

Closes cortexlinux#222

## Summary
- Full cgroups v2 integration with unified hierarchy support
- Workload presets (inference, training, batch, interactive)
- CPU quota/weight, memory max/high limits
- GPU isolation via CUDA_VISIBLE_DEVICES
- OOM score adjustment for process priority
- User mode delegation support
- 32 passing unit tests

## Files
- cortex/kernel_features/accelerator_limits.py (718 lines)
- tests/test_accelerator_limits.py (394 lines)
- README_ACCELERATOR_LIMITS.md
Copilot AI review requested due to automatic review settings December 4, 2025 05:17
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

Walkthrough

Introduces accelerator-limits functionality for managing GPU, CPU, and memory resources for AI workloads via cgroups v2. Adds comprehensive resource management classes (AcceleratorLimitsManager, LimitsDatabase, CgroupsV2Controller, OOMScoreManager) with a CLI supporting profile creation, application to live processes, environment variable generation, status reporting, and preset-based configuration.

Changes

Cohort / File(s) Summary
Documentation
README_ACCELERATOR_LIMITS.md
Documents the accelerator-limits feature including quick-start commands (create, apply, env, status, delete), environment variables, preset configurations (inference, training, batch, interactive), architecture components, related files, and usage guidelines.
Core Implementation
cortex/kernel_features/accelerator_limits.py
Introduces ResourceLimits dataclass with factory methods; LimitsDatabase for persistent profile storage with PID tracking; CgroupsV2Controller for cgroups v2 operations (creation, CPU/memory limits, PID management); OOMScoreManager for OOM score adjustments; AcceleratorLimitsManager as central orchestrator for create/apply/status/delete operations; comprehensive CLI entry point with subcommands and preset selection.
Tests
tests/test_accelerator_limits.py
Comprehensive unit test module covering ResourceLimits validation, LimitsDatabase CRUD operations, CgroupsV2Controller behavior, OOMScoreManager functionality, AcceleratorLimitsManager lifecycle, preset definitions, and CLI integration.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as CLI<br/>(main)
    participant ALM as AcceleratorLimitsManager
    participant DB as LimitsDatabase
    participant Cgroups as CgroupsV2Controller
    participant OOM as OOMScoreManager
    participant SysFS as Linux cgroups v2

    User->>CLI: cortex limits apply profile-name --pid 12345
    CLI->>ALM: apply(profile_name, pid)
    ALM->>DB: get(profile_name)
    DB-->>ALM: ResourceLimits
    ALM->>Cgroups: create_cgroup(path)
    Cgroups->>SysFS: mkdir cgroup hierarchy
    SysFS-->>Cgroups: success
    Cgroups-->>ALM: true
    ALM->>Cgroups: apply_cpu_limits(path, quota, weight)
    Cgroups->>SysFS: write cpu.max, cpu.weight
    SysFS-->>Cgroups: success
    Cgroups-->>ALM: true
    ALM->>Cgroups: apply_memory_limits(path, max, high)
    Cgroups->>SysFS: write memory.max, memory.high
    SysFS-->>Cgroups: success
    Cgroups-->>ALM: true
    ALM->>Cgroups: add_pid(path, pid)
    Cgroups->>SysFS: write cgroup.procs
    SysFS-->>Cgroups: success
    Cgroups-->>ALM: true
    ALM->>OOM: set_oom_score_adj(pid, score)
    OOM->>SysFS: write /proc/[pid]/oom_score_adj
    SysFS-->>OOM: success
    OOM-->>ALM: true
    ALM->>DB: record_pid(pid, profile_name)
    DB-->>ALM: recorded
    ALM-->>CLI: true (success)
    CLI-->>User: Profile applied successfully
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • CgroupsV2Controller implementation: Verify system-level cgroups v2 operations (file I/O, error handling, permission delegation), path construction, and availability checks
  • LimitsDatabase schema and CRUD logic: Ensure database initialization, profile persistence, and PID tracking are correct and handle edge cases
  • AcceleratorLimitsManager orchestration: Validate component interactions, error propagation, and graceful fallbacks across create/apply/delete operations
  • CLI parameter validation: Review CLI argument parsing, preset selection, and override handling (CPU/memory/GPU/OOM options)
  • Integration points: Verify data flow between manager, database, cgroups controller, and OOM manager; check for race conditions or missing cleanup

Possibly related PRs

Suggested labels

enhancement, kernel-features

Poem

🐰 New limits bloom in cgroups deep,
GPU cores and memory to keep,
Presets dance from train to play,
Resources managed AI's way!
hops in approval

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: completing accelerator-aware resource limits implementation with reference to the linked issue.
Description check ✅ Passed The PR description is comprehensive with summary, changes, files, acceptance criteria, test plan, and usage examples. It exceeds template requirements and provides clear context.
Linked Issues check ✅ Passed All coding requirements from issue #222 are met: cgroups v2 integration, workload presets (inference/training/batch/interactive), GPU isolation via environment variables, OOM score adjustment, CPU controls, memory limits, user-mode delegation, and 32 passing unit tests.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #222 objectives: new accelerator_limits.py module, comprehensive tests, and documentation. No out-of-scope modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 94.37% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 4, 2025

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
cortex/kernel_features/accelerator_limits.py (4)

29-37: Avoid configuring global logging at import time in a library module

Calling logging.basicConfig at module import can interfere with applications’ own logging configuration. Prefer configuring logging in the CLI entrypoint (or the top-level app) and keep this module to just creating a logger with getLogger(__name__).

-# Set up logging
-logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
-logger = logging.getLogger(__name__)
+# Set up module-level logger; configuration should be done by the caller/CLI.
+logger = logging.getLogger(__name__)

You can then configure logging inside main() if desired, e.g. logging.basicConfig(... ) there rather than at import time.


243-266: Tighten broad except Exception usage and improve logging

Several methods (create_cgroup, apply_cpu_limits, apply_memory_limits, add_pid, get_pids, and the OOMScoreManager methods) catch bare Exception, sometimes with pass, which can hide unexpected errors and complicate debugging.

  • Narrow the caught exceptions where feasible (OSError, IOError, ValueError, etc.).
  • Where you do intend to catch “anything” for robustness, log with logger.exception(...) instead of logger.error(...) or pass so stack traces are preserved.
    For example:
-        except Exception as e:
-            logger.error(f"Failed to set CPU limits: {e}")
+        except Exception:
+            logger.exception("Failed to set CPU limits")

And for try/except blocks that currently just pass, at least log at debug level so silent failures are observable when needed.

Also applies to: 284-300, 313-329, 336-344, 351-357, 386-394, 400-404


467-471: Remove unnecessary f-string in log message

The print call here uses an f-string without any interpolation.

-        else:
-            print(f"Could not set OOM score (may need root)")
+        else:
+            print("Could not set OOM score (may need root)")
             success = False

This also addresses the Ruff/Sonar warning about an f-string with no replacement fields.


638-672: Optional: deduplicate repeated "Profile name" help text

The literal "Profile name" is repeated across multiple subparsers; Sonar flags this for duplication. Not critical, but you could factor it into a small constant to centralize future changes.

PROFILE_NAME_HELP = "Profile name"

create_parser.add_argument("name", help=PROFILE_NAME_HELP)
apply_parser.add_argument("name", help=PROFILE_NAME_HELP)
delete_parser.add_argument("name", help=PROFILE_NAME_HELP)
env_parser.add_argument("name", help=PROFILE_NAME_HELP)

This is purely cosmetic and can be deferred if you prefer.

tests/test_accelerator_limits.py (2)

1-12: Consider removing the shebang from this test module

The file has a #!/usr/bin/env python3 shebang but is not typically executed directly (it’s run via pytest/unittest). This triggers EXE001 because the file isn’t executable and is unnecessary for a test module.

-#!/usr/bin/env python3
-"""
-Unit tests for Cortex Accelerator-Aware Resource Limits
-"""
+"""
+Unit tests for Cortex Accelerator-Aware Resource Limits
+"""

Keeping tests as plain modules avoids the linter warning and matches common practice.


212-223: Use _ for the unused path variable in cgroup creation test

In test_create_cgroup_mock, path is unpacked but never used, which Ruff flags as RUF059. Using _ clarifies that the second return value is intentionally ignored.

-            controller = CgroupsV2Controller(root)
-            success, path = controller.create_cgroup("cortex.slice/test-job")
+            controller = CgroupsV2Controller(root)
+            success, _ = controller.create_cgroup("cortex.slice/test-job")

This keeps the test behavior unchanged while addressing the linter warning.

README_ACCELERATOR_LIMITS.md (1)

124-135: Add a language to the architecture fenced code block

markdownlint flags this block for missing a language identifier. Since it’s ASCII art / diagram, text (or none) would be appropriate.

-```
+```text
 AcceleratorLimitsManager
 ├── LimitsDatabase (SQLite storage)
 ├── CgroupsV2Controller (cgroups interface)
 └── OOMScoreManager (OOM score adjustment)
@@
 ├── GPU: device IDs, percentage
 └── OOM: score adjustment
This keeps rendering unchanged while satisfying MD040.

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between da3e6355b164ac48826016b8745cbdb2a59df02b and e7ec868a2a3b2206c6e3d5af92fe60faa8713c97.

</details>

<details>
<summary>📒 Files selected for processing (3)</summary>

* `README_ACCELERATOR_LIMITS.md` (1 hunks)
* `cortex/kernel_features/accelerator_limits.py` (1 hunks)
* `tests/test_accelerator_limits.py` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (1)</summary>

<details>
<summary>tests/test_accelerator_limits.py (1)</summary><blockquote>

<details>
<summary>cortex/kernel_features/accelerator_limits.py (24)</summary>

* `ResourceLimits` (90-135)
* `LimitsDatabase` (138-213)
* `CgroupsV2Controller` (216-368)
* `OOMScoreManager` (371-405)
* `AcceleratorLimitsManager` (408-613)
* `WorkloadPreset` (39-44)
* `from_preset` (113-126)
* `to_dict` (128-130)
* `from_dict` (133-135)
* `save` (166-173)
* `get` (175-183)
* `list_all` (192-196)
* `delete` (185-190)
* `delete` (587-600)
* `record_pid` (198-204)
* `get_pids` (206-213)
* `get_pids` (346-358)
* `is_available` (223-225)
* `get_available_controllers` (236-241)
* `create_cgroup` (243-266)
* `get_oom_score_adj` (397-405)
* `create` (416-443)
* `get_env` (478-507)
* `main` (616-714)

</details>

</blockquote></details>

</details><details>
<summary>🪛 GitHub Check: SonarCloud Code Analysis</summary>

<details>
<summary>tests/test_accelerator_limits.py</summary>

[warning] 219-219: Replace the unused local variable "path" with "_".

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrny9ER6oUfBTHzwt2H&open=AZrny9ER6oUfBTHzwt2H&pullRequest=230

</details>
<details>
<summary>cortex/kernel_features/accelerator_limits.py</summary>

[warning] 470-470: Add replacement fields or use a normal string instead of an f-string.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrny9A_6oUfBTHzwt2G&open=AZrny9A_6oUfBTHzwt2G&pullRequest=230

---

[failure] 243-243: Refactor this function to reduce its Cognitive Complexity from 17 to the 15 allowed.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrny9A_6oUfBTHzwt2F&open=AZrny9A_6oUfBTHzwt2F&pullRequest=230

---

[failure] 639-639: Define a constant instead of duplicating this literal "Profile name" 4 times.

See more on https://sonarcloud.io/project/issues?id=cortexlinux_cortex&issues=AZrny9A_6oUfBTHzwt2E&open=AZrny9A_6oUfBTHzwt2E&pullRequest=230

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>README_ACCELERATOR_LIMITS.md</summary>

124-124: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.14.7)</summary>

<details>
<summary>tests/test_accelerator_limits.py</summary>

1-1: Shebang is present but file is not executable

(EXE001)

---

219-219: Unpacked variable `path` is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

</details>
<details>
<summary>cortex/kernel_features/accelerator_limits.py</summary>

265-265: Do not catch blind exception: `Exception`

(BLE001)

---

294-294: Consider moving this statement to an `else` block

(TRY300)

---

298-298: Do not catch blind exception: `Exception`

(BLE001)

---

299-299: Use `logging.exception` instead of `logging.error`

Replace with `exception`

(TRY400)

---

323-323: Consider moving this statement to an `else` block

(TRY300)

---

327-327: Do not catch blind exception: `Exception`

(BLE001)

---

328-328: Use `logging.exception` instead of `logging.error`

Replace with `exception`

(TRY400)

---

342-342: Do not catch blind exception: `Exception`

(BLE001)

---

343-343: Use `logging.exception` instead of `logging.error`

Replace with `exception`

(TRY400)

---

356-357: `try`-`except`-`pass` detected, consider logging the exception

(S110)

---

356-356: Do not catch blind exception: `Exception`

(BLE001)

---

366-366: Consider moving this statement to an `else` block

(TRY300)

---

392-392: Do not catch blind exception: `Exception`

(BLE001)

---

393-393: Use `logging.exception` instead of `logging.error`

Replace with `exception`

(TRY400)

---

403-404: `try`-`except`-`pass` detected, consider logging the exception

(S110)

---

403-403: Do not catch blind exception: `Exception`

(BLE001)

---

470-470: f-string without any placeholders

Remove extraneous `f` prefix

(F541)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Agent

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines 104 to +111
def __post_init__(self):
self.gpu_ids = self.gpu_ids or []

if self.gpu_ids is None:
self.gpu_ids = []
if self.cpu_affinity is None:
self.cpu_affinity = []
if not self.cgroup_path:
self.cgroup_path = f"{CORTEX_CGROUP}/{self.name}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize profile names before embedding them into cgroup paths

cgroup_path is built directly from name and later used as a relative path under /sys/fs/cgroup. If a caller passes characters like .. or path separators, this can escape the intended CORTEX_CGROUP subtree and potentially alter limits of unrelated cgroups when run as root.

     def __post_init__(self):
         if self.gpu_ids is None:
             self.gpu_ids = []
         if self.cpu_affinity is None:
             self.cpu_affinity = []
         if not self.cgroup_path:
-            self.cgroup_path = f"{CORTEX_CGROUP}/{self.name}"
+            # Sanitize profile name to avoid escaping the Cortex cgroup subtree.
+            safe_name = "".join(
+                c for c in self.name
+                if c.isalnum() or c in ("-", "_", ".")
+            )
+            if not safe_name:
+                raise ValueError(f"Invalid profile name {self.name!r}")
+            self.cgroup_path = f"{CORTEX_CGROUP}/{safe_name}"

This keeps existing tests (which use simple names) passing while preventing path traversal via the profile name.

🤖 Prompt for AI Agents
In cortex/kernel_features/accelerator_limits.py around lines 104-111, sanitize
self.name before embedding it into cgroup_path: compute a safe_name (e.g., use
os.path.basename(self.name) or strip all path separators and '..', then replace
any non-alphanumeric/_/- characters with a safe substitute or fallback to a
generated slug), ensure it has no leading '/' and no '..' segments, then build
cgroup_path using os.path.join(CORTEX_CGROUP, safe_name), normalize with
os.path.normpath, and finally assert or raise if the resulting path does not
start with CORTEX_CGROUP to prevent path traversal.

Comment on lines +500 to +506
# CPU hints
if limits.cpu_quota > 0:
cores = int(limits.cpu_quota / 100)
env["OMP_NUM_THREADS"] = str(cores)
env["MKL_NUM_THREADS"] = str(cores)
env["OPENBLAS_NUM_THREADS"] = str(cores)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against OMP_NUM_THREADS=0 when cpu_quota < 100

cores = int(limits.cpu_quota / 100) will produce 0 if a user supplies a cpu_quota lower than 100%, leading to OMP_NUM_THREADS=0 and similar values, which are invalid for most runtimes.

-        if limits.cpu_quota > 0:
-            cores = int(limits.cpu_quota / 100)
+        if limits.cpu_quota > 0:
+            # Ensure at least 1 thread even for sub-100% quotas.
+            cores = max(1, int(limits.cpu_quota / 100))
             env["OMP_NUM_THREADS"] = str(cores)
             env["MKL_NUM_THREADS"] = str(cores)
             env["OPENBLAS_NUM_THREADS"] = str(cores)

This keeps current behavior for all presets (multiples of 100) while making custom low-CPU profiles safe.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# CPU hints
if limits.cpu_quota > 0:
cores = int(limits.cpu_quota / 100)
env["OMP_NUM_THREADS"] = str(cores)
env["MKL_NUM_THREADS"] = str(cores)
env["OPENBLAS_NUM_THREADS"] = str(cores)
# CPU hints
if limits.cpu_quota > 0:
# Ensure at least 1 thread even for sub-100% quotas.
cores = max(1, int(limits.cpu_quota / 100))
env["OMP_NUM_THREADS"] = str(cores)
env["MKL_NUM_THREADS"] = str(cores)
env["OPENBLAS_NUM_THREADS"] = str(cores)
🤖 Prompt for AI Agents
In cortex/kernel_features/accelerator_limits.py around lines 500 to 506, the
calculation cores = int(limits.cpu_quota / 100) can produce 0 for cpu_quota <
100 causing invalid OMP_NUM_THREADS=0; change the calculation to ensure at least
1 thread by using cores = max(1, int(limits.cpu_quota / 100)) (keeping the
existing if limits.cpu_quota > 0 guard) so presets (multiples of 100) behave the
same while low custom cpu_quota values result in a safe minimum of 1.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive accelerator-aware resource limits system for AI workloads using cgroups v2. It introduces a Python-based wrapper that allows users to create and manage resource profiles with predefined workload presets (inference, training, batch, interactive) and apply them to processes. The implementation includes CPU quota/weight controls, memory limits (hard and soft), GPU device isolation via environment variables for multiple vendors (NVIDIA, AMD, Intel), OOM score adjustments, and SQLite-based profile storage with a full CLI interface.

Key Changes:

  • Complete cgroups v2 controller implementation with unified hierarchy support and user mode delegation
  • Four workload presets optimized for different AI use cases with sensible CPU, memory, GPU, and OOM defaults
  • Multi-vendor GPU isolation supporting NVIDIA CUDA, AMD ROCm, and Intel oneAPI through environment variables

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 17 comments.

File Description
cortex/kernel_features/accelerator_limits.py Complete implementation with ResourceLimits dataclass, LimitsDatabase (SQLite), CgroupsV2Controller, OOMScoreManager, AcceleratorLimitsManager, and CLI (718 lines)
tests/test_accelerator_limits.py Comprehensive unit tests covering all major components with 32 test cases including preset validation, database operations, cgroups mocking, and CLI testing (394 lines)
README_ACCELERATOR_LIMITS.md Complete documentation with quick start guide, preset descriptions, command reference, environment variables, user mode setup, architecture overview, and testing instructions
Comments suppressed due to low confidence (1)

cortex/kernel_features/accelerator_limits.py:22

  • Import of 'shutil' is not used.
import shutil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import logging

# Configuration
CORTEX_DB = Path.home() / ".cortex/limits.db"
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The database path uses Path.home() which resolves to the home directory of the user running the process. If the code is run with sudo or as a different user, this could lead to inconsistent database locations (e.g., /root/.cortex/limits.db vs /home/user/.cortex/limits.db). This could cause profiles created by one user to be invisible to another user, even when they have the necessary permissions to manage cgroups.

Consider documenting this behavior or using a consistent system-wide location (e.g., /var/lib/cortex/limits.db) when running as root, with fallback to user home directory for non-root users.

Suggested change
CORTEX_DB = Path.home() / ".cortex/limits.db"
def get_cortex_db_path():
if hasattr(os, "geteuid") and os.geteuid() == 0:
db_path = Path("/var/lib/cortex/limits.db")
else:
db_path = Path.home() / ".cortex/limits.db"
db_path.parent.mkdir(parents=True, exist_ok=True)
return db_path
CORTEX_DB = get_cortex_db_path()

Copilot uses AI. Check for mistakes.
Features:
- cgroups v2 unified hierarchy support
- Workload presets with sensible defaults
- NVIDIA GPU isolation via environment variables
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation in the docstring at line 11 states "NVIDIA GPU isolation via environment variables", but the implementation also supports AMD ROCm (HIP_VISIBLE_DEVICES) and Intel oneAPI (ONEAPI_DEVICE_SELECTOR) as seen in lines 488-492. The docstring should be updated to reflect this multi-vendor support.

Suggested update:

- GPU isolation via CUDA_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, ONEAPI_DEVICE_SELECTOR
Suggested change
- NVIDIA GPU isolation via environment variables
- GPU isolation via CUDA_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES, ONEAPI_DEVICE_SELECTOR

Copilot uses AI. Check for mistakes.
Comment on lines +516 to +517
for key, value in env.items():
print(f"export {key}={value}")
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable values are printed without any shell escaping, which could lead to shell injection vulnerabilities if a malicious profile name or GPU IDs are stored in the database. While the values are currently generated from integers and controlled data, the profile name comes from user input and is displayed in error messages.

Consider using shlex.quote() to properly escape the values, especially since this output is intended to be evaluated with eval $(cortex limits env ...):

import shlex
# ...
for key, value in env.items():
    print(f"export {key}={shlex.quote(value)}")

Additionally, the profile_name used in the error message should be escaped or validated.

Copilot uses AI. Check for mistakes.
CORTEX_CGROUP = "cortex.slice"

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging.basicConfig() call at module level can interfere with the logging configuration of other modules or the main application. This is because basicConfig() only works if the root logger hasn't been configured yet, and calling it in a library module is generally discouraged.

Consider removing this line and letting the application configure logging, or check if the root logger is already configured before calling basicConfig():

if not logging.getLogger().hasHandlers():
    logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
Suggested change
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

Copilot uses AI. Check for mistakes.
Comment on lines +684 to +685
if args.command == "create":
limits = ResourceLimits.from_preset(args.name, args.preset, args.gpus)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name parameter from user input is used directly in f-strings for cgroup paths and database operations without validation. This could potentially lead to path traversal or SQL injection issues (though SQLite with parameterized queries provides some protection).

Consider adding validation to ensure the profile name only contains safe characters:

import re

def validate_profile_name(name: str) -> bool:
    """Validate profile name contains only safe characters."""
    return bool(re.match(r'^[a-zA-Z0-9_-]+$', name))

This should be called in the create() method and other places where profile names are accepted from user input.

Copilot uses AI. Check for mistakes.

import os
import sys
import json
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'json' is not used.

Suggested change
import json

Copilot uses AI. Check for mistakes.
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch, MagicMock
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'MagicMock' is not used.

Suggested change
from unittest.mock import patch, MagicMock
from unittest.mock import patch

Copilot uses AI. Check for mistakes.
try:
subtree_control.write_text(f"+{ctrl}")
except PermissionError:
pass
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
pass
logger.warning(f"Permission denied enabling controller '{ctrl}' in {subtree_control}")

Copilot uses AI. Check for mistakes.
Comment on lines +356 to +357
except Exception:
pass
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except Exception as e:
logger.error(f"Failed to read PIDs from {procs_file}: {e}")

Copilot uses AI. Check for mistakes.
Comment on lines +403 to +404
except Exception:
pass
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except Exception as e:
logger.error(f"Failed to get OOM score for PID {pid}: {e}")

Copilot uses AI. Check for mistakes.
@mikejmorgan-ai
Copy link
Member

Closing as duplicate - Issue #222 was already implemented in commit 7c2cb79. Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Kernel Feature] Accelerator-Aware Resource Limits - cgroups v2 Wrapper for AI

2 participants