fix(gpu): detect Jetson Thor and Orin unified memory GPUs#308
fix(gpu): detect Jetson Thor and Orin unified memory GPUs#308kagura-agent wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds and exports a frozen Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Node CLI
participant lib as nim.detectGpu()
participant nvsmi as nvidia-smi
participant OS as System RAM
CLI->>lib: invoke detectGpu()
lib->>nvsmi: query name & memory
alt nvidia-smi returns usable memory
nvsmi-->>lib: memory, name
lib-->>CLI: GPU descriptor (discrete)
else nvidia-smi memory is [N/A]
nvsmi-->>lib: name only
lib->>lib: match name against UNIFIED_MEMORY_CHIPS
alt match found
lib->>OS: read total system RAM
OS-->>lib: totalMemoryMB
lib-->>CLI: GPU descriptor (unifiedMemory: true, spark if gb10)
else no match
lib-->>CLI: assume no GPU (CPU-only)
end
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/lib/nim.js (1)
59-59: Consider case-insensitive matching for robustness.The current substring check is case-sensitive. While
nvidia-smitypically returns consistent casing, a case-insensitive match would be more defensive against unexpected output variations.♻️ Suggested change for case-insensitive matching
- if (nameOutput && UNIFIED_MEMORY_CHIPS.some((chip) => nameOutput.includes(chip))) { + if (nameOutput && UNIFIED_MEMORY_CHIPS.some((chip) => nameOutput.toLowerCase().includes(chip.toLowerCase()))) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/nim.js` at line 59, The substring check against UNIFIED_MEMORY_CHIPS is currently case-sensitive; update the condition that uses nameOutput and UNIFIED_MEMORY_CHIPS (the if with nameOutput && UNIFIED_MEMORY_CHIPS.some(...)) to perform a case-insensitive comparison by normalizing both sides (e.g., call toLowerCase() on nameOutput and on each chip string or use a case-insensitive regex) before calling includes/some, and keep the existing null/undefined guard for nameOutput.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/nim.js`:
- Line 59: The substring check against UNIFIED_MEMORY_CHIPS is currently
case-sensitive; update the condition that uses nameOutput and
UNIFIED_MEMORY_CHIPS (the if with nameOutput && UNIFIED_MEMORY_CHIPS.some(...))
to perform a case-insensitive comparison by normalizing both sides (e.g., call
toLowerCase() on nameOutput and on each chip string or use a case-insensitive
regex) before calling includes/some, and keep the existing null/undefined guard
for nameOutput.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4943f10e-5dba-4300-8e12-5ca21b084ca4
📒 Files selected for processing (2)
bin/lib/nim.jstest/nim-jetson.test.js
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/nim.js`:
- Line 71: The spark flag uses case-sensitive matching
(nameOutput.includes("GB10")) which is inconsistent with the earlier
case-insensitive chip detection that uses toLowerCase(); change spark to perform
a case-insensitive check (e.g., use nameOutput.toLowerCase().includes("gb10"))
so both unified memory detection and the spark flag behave consistently; update
the expression that sets spark accordingly (reference: the spark assignment and
the earlier toLowerCase() usage).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c29191fa-b7e3-4bfe-bfd5-b76d1f87465b
📒 Files selected for processing (1)
bin/lib/nim.js
|
Addressed both CodeRabbit review comments in 89e5cc6:
All 54 tests pass. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/lib/nim.js (1)
26-27: Consider freezing the exported array to prevent accidental mutation.Since
UNIFIED_MEMORY_CHIPSis exported and could be used by external consumers, making it immutable prevents accidental modifications that could affect detection logic.♻️ Optional: freeze the constant
// Chip names that use unified memory (VRAM not separately queryable) -const UNIFIED_MEMORY_CHIPS = ["GB10", "Thor", "Orin", "Xavier"]; +const UNIFIED_MEMORY_CHIPS = Object.freeze(["GB10", "Thor", "Orin", "Xavier"]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/nim.js` around lines 26 - 27, UNIFIED_MEMORY_CHIPS is a mutable exported array which could be accidentally modified; make it immutable by freezing it where it's declared (e.g., replace the raw array with Object.freeze([...]) so UNIFIED_MEMORY_CHIPS is assigned to a frozen array) to prevent runtime mutations that would break chip detection logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/nim.js`:
- Around line 26-27: UNIFIED_MEMORY_CHIPS is a mutable exported array which
could be accidentally modified; make it immutable by freezing it where it's
declared (e.g., replace the raw array with Object.freeze([...]) so
UNIFIED_MEMORY_CHIPS is assigned to a frozen array) to prevent runtime mutations
that would break chip detection logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a5e3b7ec-584c-4caf-89d7-70256f3df3af
📒 Files selected for processing (2)
bin/lib/nim.jstest/nim-jetson.test.js
✅ Files skipped from review due to trivial changes (1)
- test/nim-jetson.test.js
Closes NVIDIA#300 The unified-memory fallback in detectGpu() only checked for "GB10" (DGX Spark). Jetson Thor and Orin report different chip names, causing GPU detection to fail and fallback to cloud inference. Add Thor, Orin, and Xavier to the unified-memory chip name list so all current Jetson and DGX Spark platforms are detected correctly.
…tion Address CodeRabbit review: nvidia-smi output casing may vary, so normalize both sides with toLowerCase() for robustness.
Address CodeRabbit review comments: - Use toLowerCase() for spark flag (consistent with chip detection) - Update test to verify case-insensitive GB10 matching
89e5cc6 to
752f1af
Compare
|
Re: latest CodeRabbit review — the |
|
Closing to reduce open PR count — I had too many PRs open, which adds review burden rather than helping. Happy to resubmit if this fix is still wanted. |
* improve docs * save
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes #1191 Fixes #1160 Fixes #1182 Fixes #1162 Related #991 ## Notes - `#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - #1106 - #308 - #95 - #770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes #1191 Fixes #1160 Fixes #1182 Fixes #1162 Related #991 ## Notes - `#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - #1106 - #308 - #95 - #770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes NVIDIA#1191 Fixes NVIDIA#1160 Fixes NVIDIA#1182 Fixes NVIDIA#1162 Related NVIDIA#991 ## Notes - `NVIDIA#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - NVIDIA#1106 - NVIDIA#308 - NVIDIA#95 - NVIDIA#770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - stop requiring `NVIDIA_API_KEY` for local-only `nemoclaw start` and only gate the Telegram bridge when that bridge actually needs the key - clean up the dashboard forward, `nemoclaw` gateway, and `openshell-cluster-nemoclaw` Docker volumes when the last sandbox is destroyed - broaden unified-memory NVIDIA GPU detection beyond `GB10` while keeping `spark: true` specific to GB10 - harden policy merge/retry behavior so truncated or error-like current-policy reads rebuild from a clean `version: 1` scaffold instead of producing malformed YAML ## Issue Mapping Fixes NVIDIA#1191 Fixes NVIDIA#1160 Fixes NVIDIA#1182 Fixes NVIDIA#1162 Related NVIDIA#991 ## Notes - `NVIDIA#1188` was investigated but is not included in this PR. - The current evidence still points to a deeper runtime / proxy reachability problem on macOS + Colima rather than a bounded NemoClaw-only fix. - Keeping it out of this branch avoids speculative networking changes without strong reproduction and cross-platform coverage. ## Validation ```bash npx vitest run npx eslint bin/nemoclaw.js bin/lib/nim.js bin/lib/policies.js test/cli.test.js test/nim.test.js test/policies.test.js test/service-env.test.js npx tsc -p jsconfig.json --noEmit ``` ## References Reviewed - NVIDIA#1106 - NVIDIA#308 - NVIDIA#95 - NVIDIA#770 Signed-off-by: Kevin Jones <kejones@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** - Core services can start without an NVIDIA API key. - Enhanced unified‑memory GPU detection with more accurate capability reporting. * **Bug Fixes** - Gateway and forwarded‑port cleanup only runs when the last sandbox is removed and no live sandboxes remain. - Telegram bridge now starts only when both required tokens are present; clearer startup warnings. - Policy parsing/merge more robust for metadata‑only or malformed inputs; consistent version header formatting. * **Tests** - Added tests covering GPU detection, policy parsing/merge, CLI sandbox/gateway flows, and service startup. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Closes #300
The unified-memory fallback in
detectGpu()only checked for"GB10"(DGX Spark). Jetson Thor and Orin report different chip names vianvidia-smi --query-gpu=name, causing GPU detection to fail and fallback to cloud inference.Changes
bin/lib/nim.jsUNIFIED_MEMORY_CHIPS = ["GB10", "Thor", "Orin", "Xavier"]constant.some(chip => nameOutput.includes(chip))instead of.includes("GB10")unifiedMemory: trueandnameproperties to the return objectsparkproperty true only for GB10 (backward compat)test/nim-jetson.test.js(new)sparkflag logicTesting
All 54 tests pass (43 existing + 11 new):
Notes
detectGpuonly exists inbin/lib/nim.jsSummary by CodeRabbit
New Features
Tests