Skip to content

test: add GPU detection tests with dependency injection#142

Closed
brianwtaylor wants to merge 9 commits intoNVIDIA:mainfrom
brianwtaylor:test/gpu-detection-spark-paths
Closed

test: add GPU detection tests with dependency injection#142
brianwtaylor wants to merge 9 commits intoNVIDIA:mainfrom
brianwtaylor:test/gpu-detection-spark-paths

Conversation

@brianwtaylor
Copy link
Copy Markdown
Contributor

@brianwtaylor brianwtaylor commented Mar 17, 2026

Summary

  • Add optional opts parameter to detectGpu() for dependency injection of runCapture and process.platform — existing callers are unaffected (backward compatible)
  • Enrich GPU detection output: surface GPU name for NVIDIA, detect DGX Spark GB10 unified memory
  • Add suggestModelsForGpu() to rank NIM models by VRAM fit and mark the optimal model as recommended
  • Harden nim.js: validate sandbox names in containerName() to prevent command injection, replace process.exit(1) with thrown errors, accept configurable port in nimStatus()
  • Add port 8000 preflight check before starting local NIM container
  • Add per-request timeout to NIM health probe to prevent hung curl from defeating the polling loop

GPU detection previously had zero test coverage. Any regression in detectGpu() silently breaks onboarding for entire hardware classes (DGX Spark, Apple Silicon, multi-GPU) with no CI signal.

Test plan

Automated Tests

node --test test/nim.test.js

28 tests covering GPU detection (8 injected paths: standard NVIDIA, multi-GPU, DGX Spark GB10, Apple Silicon discrete/unified, no-GPU), model recommendation (7 VRAM-based ranking tests), container name sanitization (3 tests), and NIM lifecycle (pull, health, status).

@brianwtaylor brianwtaylor force-pushed the test/gpu-detection-spark-paths branch from b949de7 to 0ee1b3f Compare March 18, 2026 00:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds dependency injection to GPU detection (detectGpu(opts)), enriches detection output, and centralizes CLI invocation through an injectable runCmd. Also introduces model/image helpers and container lifecycle functions (pull/start/stop/wait/status) in bin/lib/nim.js. Tests added to validate injected GPU detection scenarios.

Changes

Cohort / File(s) Summary
Nim library (GPU, images, containers)
bin/lib/nim.js
Refactors GPU detection to detectGpu(opts) with injectable runCmd and platform; richer detection result (type, count, totalMemoryMB, perGpuMB, nimCapable, optional spark/name/cores); adds containerName, getImageForModel, listModels, pullNimImage, startNimContainer, waitForNimHealth, stopNimContainer, nimStatus; routes CLI calls through injected runner; minor data-shape safety tweaks and docs.
Tests (injected GPU detection)
test/nim.test.js
Adds "detectGpu (injected)" suite and mockRunCapture to simulate CLI outputs across many scenarios (NVIDIA single/multi, DGX Spark GB10, Spark failures, macOS discrete/Apple Silicon, no-GPU, non-GB10 NVIDIA).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Test as "Client / Test"
    participant Detect as "detectGpu(opts)"
    participant Runner as "runCmd (injected)"
    participant System as "OS / External CLI"

    Test->>Detect: call detectGpu({ runCmd, platform })
    Detect->>Runner: request command (e.g., nvidia-smi, free, sysctl, system_profiler)
    Runner->>System: execute shell command
    System-->>Runner: command stdout/stderr or error
    Runner-->>Detect: return captured output / error
    Detect->>Detect: parse outputs, apply fallbacks, determine GPU type/size/cores
    Detect-->>Test: return detection object (or null)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hop through logs and capture shells,
Inject my runner, count the spells,
From CUDA fans to Apple cores,
I wag my whiskers, check the scores,
A tiny drum — containers roar! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'test: add GPU detection tests with dependency injection' accurately describes the main change—adding tests for GPU detection with dependency injection support.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/nim.js`:
- Around line 26-33: The JSDoc promises cores?: number but the detector can
return cores: null; update the runtime in the GPU detection function (the
"Detect GPU hardware" function in bin/lib/nim.js that builds the returned
object) so it omits the cores property entirely when the value is unknown
(null/undefined), preserving the optional-number contract, and keep the JSDoc as
cores?: number; ensure the returned object only includes cores when it's a valid
number.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bd571467-8df9-4090-91e7-54c2240aed23

📥 Commits

Reviewing files that changed from the base of the PR and between 25ddffd and 02f5f5f.

📒 Files selected for processing (1)
  • bin/lib/nim.js

Comment thread bin/lib/nim.js
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
bin/lib/nim.js (2)

132-137: ⚠️ Potential issue | 🟠 Major

Avoid process.exit(1) inside library functions.

Line 136 and Line 149 terminate the entire process from helper functions, which makes these APIs unsafe for reuse and testing. Throw an error and let the CLI layer decide whether to exit.

🛠️ Suggested fix
 function pullNimImage(model) {
   const image = getImageForModel(model);
   if (!image) {
-    console.error(`  Unknown model: ${model}`);
-    process.exit(1);
+    throw new Error(`Unknown model: ${model}`);
   }
   console.log(`  Pulling NIM image: ${image}`);
   run(`docker pull ${image}`);
   return image;
 }

 function startNimContainer(sandboxName, model, port = 8000) {
   const name = containerName(sandboxName);
   const image = getImageForModel(model);
   if (!image) {
-    console.error(`  Unknown model: ${model}`);
-    process.exit(1);
+    throw new Error(`Unknown model: ${model}`);
   }

Also applies to: 147-150

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/nim.js` around lines 132 - 137, The helper functions (e.g.,
pullNimImage) currently call process.exit(1) when getImageForModel(model)
returns falsy (and similarly at the other spot around lines 147–150); replace
those process.exit calls with throwing a descriptive Error (e.g., throw new
Error(`Unknown model: ${model}`)) so callers can handle termination at the CLI
layer instead of library code, update any other helper functions that use
process.exit to throw instead, and ensure the CLI entrypoint catches these
Errors and exits with the appropriate status.

193-210: ⚠️ Potential issue | 🟠 Major

nimStatus ignores configured port and can report false unhealthy state.

startNimContainer() allows custom ports, but Line 205 always checks localhost:8000. For non-default ports this returns incorrect health.

✅ Suggested fix
-/** `@param` {string} sandboxName `@returns` {{running: boolean, healthy?: boolean, container: string, state?: string}} */
-function nimStatus(sandboxName) {
+/** `@param` {string} sandboxName `@param` {number} [port=8000] `@returns` {{running: boolean, healthy?: boolean, container: string, state?: string}} */
+function nimStatus(sandboxName, port = 8000) {
   const name = containerName(sandboxName);
   try {
@@
     let healthy = false;
     if (state === "running") {
-      const health = runCapture(`curl -sf http://localhost:8000/v1/models 2>/dev/null`, {
+      const health = runCapture(`curl -sf http://localhost:${port}/v1/models 2>/dev/null`, {
         ignoreError: true,
       });
       healthy = !!health;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/nim.js` around lines 193 - 210, nimStatus currently hardcodes
localhost:8000 for the health check which breaks setups where startNimContainer
starts the container on a custom port; update nimStatus to determine the actual
port used for that container (either by reading the same configuration/env used
by startNimContainer or by querying the container inspect output for the
HostPort) and use that port in the curl health check URL instead of :8000; keep
the existing ignoreError behavior and return fields (running, healthy,
container, state) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/nim.js`:
- Around line 10-11: containerName currently returns an interpolated string from
unvalidated sandboxName which can lead to command injection when used in shell
commands (the places where containerName() is later interpolated into commands);
modify containerName to validate and sanitize sandboxName before returning (for
example: enforce a strict allowlist regex like /^[a-z0-9_-]+$/i, throw or reject
inputs that don't match, and optionally normalize or escape input), or perform
this validation at the call sites that build shell commands (where
containerName() is used) to ensure only safe characters are allowed; update any
callers to handle validation errors gracefully.

---

Outside diff comments:
In `@bin/lib/nim.js`:
- Around line 132-137: The helper functions (e.g., pullNimImage) currently call
process.exit(1) when getImageForModel(model) returns falsy (and similarly at the
other spot around lines 147–150); replace those process.exit calls with throwing
a descriptive Error (e.g., throw new Error(`Unknown model: ${model}`)) so
callers can handle termination at the CLI layer instead of library code, update
any other helper functions that use process.exit to throw instead, and ensure
the CLI entrypoint catches these Errors and exits with the appropriate status.
- Around line 193-210: nimStatus currently hardcodes localhost:8000 for the
health check which breaks setups where startNimContainer starts the container on
a custom port; update nimStatus to determine the actual port used for that
container (either by reading the same configuration/env used by
startNimContainer or by querying the container inspect output for the HostPort)
and use that port in the curl health check URL instead of :8000; keep the
existing ignoreError behavior and return fields (running, healthy, container,
state) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0c8872b6-d201-4dbc-97c1-5901a45b55d8

📥 Commits

Reviewing files that changed from the base of the PR and between dc6f316 and cd21224.

📒 Files selected for processing (1)
  • bin/lib/nim.js

Comment thread bin/lib/nim.js
Add dependency injection to detectGpu() via an optional opts parameter,
enabling deterministic tests for all 4 code paths: standard NVIDIA,
DGX Spark GB10 unified memory, Apple Silicon, and no-GPU fallback.

Signed-off-by: Brian Taylor <brian@briantaylor.xyz>
Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
- Add test for the sysctl hw.memsize fallback when system_profiler
  reports no VRAM (the actual Apple Silicon code path)
- Rename existing Apple test to clarify it covers the discrete VRAM
  parsing branch
- Use more specific mock pattern "query-gpu=name" to avoid substring
  collisions
Aligns runtime behavior with JSDoc contract (cores?: number).
When system_profiler does not report core count, the property is
now omitted entirely rather than set to null.
Adds suggestModelsForGpu() that ranks NIM models by VRAM fit and
marks the optimal model as recommended. Also surfaces GPU name
in NVIDIA detection for better display during onboarding.
- Validate sandboxName in containerName() to prevent command injection
  via shell metacharacters (regex: /^[a-zA-Z0-9][a-zA-Z0-9_.-]*$/)
- Replace process.exit(1) with throw in pullNimImage and
  startNimContainer so callers can handle errors gracefully
- Accept port parameter in nimStatus() instead of hardcoding 8000
- Add tests for sandbox name validation and throw behavior
Verify the NIM port is available before pulling the image and starting
the container. Without this check, a port conflict (e.g. an existing
vLLM instance) causes an opaque docker failure instead of a clear
onboarding error with the blocking process identified.
Without a timeout, a single curl invocation can stall indefinitely on
a localhost socket, defeating the outer polling loop's timeout. Add
curl --max-time 10 and execSync timeout: 15000 as a belt-and-suspenders
bound on each probe iteration.
@brianwtaylor brianwtaylor force-pushed the test/gpu-detection-spark-paths branch from 413b25a to 8f18ff9 Compare March 19, 2026 01:12
@brianwtaylor brianwtaylor closed this by deleting the head repository Mar 20, 2026
@wscurran wscurran added the enhancement: testing Use this label to identify requests to improve NemoClaw test coverage. label Mar 20, 2026
@wscurran
Copy link
Copy Markdown
Contributor

Thanks for adding optional dependency injection to the GPU detection tests, this will help ensure our tests are more robust and reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement: testing Use this label to identify requests to improve NemoClaw test coverage.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants