fix(onboard): three NVIDIA Endpoints model probe bugs (#1601)#1602
fix(onboard): three NVIDIA Endpoints model probe bugs (#1601)#1602cv merged 2 commits intoNVIDIA:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdded three validation helpers and tests; tightened per-probe curl timeouts; optionally skipped the Responses API probe for Changes
Sequence Diagram(s)sequenceDiagram
participant Wizard as Onboard Wizard
participant Onboard as onboard.js validator
participant Curl as curl probe
participant Build as NVIDIA Build API
Wizard->>Onboard: validate provider & model (includes shouldSkipResponsesProbe)
alt Responses probed
Onboard->>Curl: POST /v1/responses (with curl args)
Curl->>Build: /v1/responses
Build-->>Curl: 404 or response
Curl-->>Onboard: probe result (raw body preserved)
end
Onboard->>Curl: POST /v1/chat/completions (with --connect-timeout 10 --max-time 15)
Curl->>Build: /v1/chat/completions
Build-->>Curl: success, timeout, or "Function 'X': Not found for account 'Y'"
Curl-->>Onboard: probe result (raw body preserved)
Onboard->>Onboard: if NVCF pattern detected -> nvcfFunctionNotFoundMessage(model)
Onboard-->>Wizard: validation outcome (success or reframed NVCF message)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/lib/validation.ts (1)
107-112: Make the rewritten message model-classifiable.If this string is fed back through
classifyValidationFailure(), it won't hit the model regex on Lines 38-39, so the recovery path degrades tounknown/selection even though the guidance says to pick a different model. Either extend the classifier fornot deployed for your account, or phrase this in a way the existing classifier already recognizes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/validation.ts` around lines 107 - 112, The message returned by nvcfFunctionNotFoundMessage should be rewritten so it matches the existing classifyValidationFailure model regex (so the classifier will route to the model-specific recovery path) — update the string to include the classifier-recognized phrasing (e.g., include "Model '<model>' not found" or "Model '<model>' not available" and also mention "not deployed for your account") so classifyValidationFailure can detect the model name and the "not found/not available" condition; modify nvcfFunctionNotFoundMessage accordingly (or alternatively add the "not deployed for your account" pattern to classifyValidationFailure) and ensure the function name nvcfFunctionNotFoundMessage and classifier classifyValidationFailure remain consistent.bin/lib/onboard.js (1)
1133-1138: Match the NVCF signal before collapsing failures tomessage.This branch only checks
failure.message. IfrunCurlProbe()ever stops carrying the raw NVCFdetailthrough that field, the special-case silently stops working even thoughresult.bodystill contains the marker. Matching against the raw body before pushing intofailureswould make this path much less brittle.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1133 - 1138, The current detection logic uses failures.find(...) and only checks failure.message with isNvcfFunctionNotFoundForAccount, which is brittle if runCurlProbe moves the raw NVCF `detail` out of message; update the detection to inspect the raw probe response before failures are collapsed — either (A) when building the failures array in the runCurlProbe result construction, preserve the raw response body/detail (e.g., attach result.body or result.detail to the failure object), or (B) change the accountFailure search to call isNvcfFunctionNotFoundForAccount against failure.message || failure.body || failure.detail (or directly against the original result.body if available); touch the code paths that create failures in runCurlProbe and the lookup using accountFailure and isNvcfFunctionNotFoundForAccount to ensure the raw NVCF marker is matched reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1133-1138: The current detection logic uses failures.find(...) and
only checks failure.message with isNvcfFunctionNotFoundForAccount, which is
brittle if runCurlProbe moves the raw NVCF `detail` out of message; update the
detection to inspect the raw probe response before failures are collapsed —
either (A) when building the failures array in the runCurlProbe result
construction, preserve the raw response body/detail (e.g., attach result.body or
result.detail to the failure object), or (B) change the accountFailure search to
call isNvcfFunctionNotFoundForAccount against failure.message || failure.body ||
failure.detail (or directly against the original result.body if available);
touch the code paths that create failures in runCurlProbe and the lookup using
accountFailure and isNvcfFunctionNotFoundForAccount to ensure the raw NVCF
marker is matched reliably.
In `@src/lib/validation.ts`:
- Around line 107-112: The message returned by nvcfFunctionNotFoundMessage
should be rewritten so it matches the existing classifyValidationFailure model
regex (so the classifier will route to the model-specific recovery path) —
update the string to include the classifier-recognized phrasing (e.g., include
"Model '<model>' not found" or "Model '<model>' not available" and also mention
"not deployed for your account") so classifyValidationFailure can detect the
model name and the "not found/not available" condition; modify
nvcfFunctionNotFoundMessage accordingly (or alternatively add the "not deployed
for your account" pattern to classifyValidationFailure) and ensure the function
name nvcfFunctionNotFoundMessage and classifier classifyValidationFailure remain
consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e782ea14-5884-45f0-bb4f-af81d0fa3b0c
📒 Files selected for processing (3)
bin/lib/onboard.jssrc/lib/validation.test.tssrc/lib/validation.ts
cv
left a comment
There was a problem hiding this comment.
LGTM — security review PASS.
- Credential handling unchanged, keys still routed through
normalizeCredentialValue() - NVCF error detection regex is restrictive and well-tested
- Model IDs validated upstream by
isSafeModelId(), no injection risk - Timeout reduction from 60s to 15s is a security positive
- Raw
bodyin failures array is internal-only, never surfaced to user - 6 new unit tests cover positive/negative matches and classification routing
No concerns.
|
Several v0.0.10 PRs just merged, including changes to |
NVIDIA Build (https://integrate.api.nvidia.com/v1) has three quirks that combine to make custom-model selection in NemoClaw onboarding look broken even when the model would otherwise work. The default Nemotron path is unaffected because it picks models that happen to mask all three behaviors. 1. Skip the Responses API probe entirely for nvidia-prod. NVIDIA Build does not expose /v1/responses for any model — every probe to that path returns '404 page not found'. The probe loop silently falls through to chat completions on success, so working models still onboard, but the noisy 404 leaks into error messages when chat completions also fails. Add shouldSkipResponsesProbe() in src/lib/validation.ts and gate the probe list in probeOpenAiLikeEndpoint on it. 2. Detect NVCF 'Function not found for account' and reframe. Many catalog models (e.g. mistralai/mistral-large) return: {"detail":"Function '<uuid>': Not found for account '<id>'"} This means the model is in the public catalog but is not deployed for the user's account/org. The raw NVCF body is opaque; surface a clear, actionable message instead. Add isNvcfFunctionNotFoundForAccount and nvcfFunctionNotFoundMessage helpers in validation.ts and use them when collecting probe failures. 3. Tighten validation probe timeout from 60s to 15s. getCurlTimingArgs() defaults to --max-time 60. That is fine for inference calls, but a hung NVIDIA backend (confirmed live on mistralai/mistral-medium-3-instruct, 30+ seconds with zero bytes) should not block the wizard for a full minute per failure mode. Add getValidationProbeCurlArgs() with --max-time 15 and use it in probeOpenAiLikeEndpoint and probeResponsesToolCalling. The catalog-only validation false-positive (Bug 4 in the issue) is intentionally out of scope for this PR — that is a behavior change that should be discussed with maintainers first. Adds 6 new unit tests in src/lib/validation.test.ts covering all three new helpers, with the literal NVCF error string reproduced from a live API run. Closes NVIDIA#1601 Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
1. Reword nvcfFunctionNotFoundMessage to start with 'Model <id> not found'
so classifyValidationFailure() matches its model.+not found regex and
routes the user into the model-selection recovery path. Without this,
the recovery flow degrades to the generic unknown/selection branch
even though the message tells the user to pick a different model.
2. Preserve the raw probe response body alongside the summarized message
in the failures array, and have the NVCF detector check both message
and body. summarizeProbeError currently surfaces the NVCF detail
string through message, but if that ever changes the detector keeps
working.
Adds a regression test asserting that classifyValidationFailure() routes
the reframed NVCF message to {kind: 'model', retry: 'model'}.
Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
aff5bf8 to
8ace970
Compare
|
@cv Thanks for the thorough security review! Rebased onto current main — clean linear history, CI should pick up a fresh run. Cheers! |
|
Thanks for the fix — the code changes look good and the review is approved. CI is failing on two pre-existing ESLint errors that your branch picked up from an older
Both have since been fixed on |
…VIDIA#1602) ## Summary - Skip the OpenAI Responses probe entirely for `nvidia-prod` (NVIDIA Build does not expose `/v1/responses` for any model — every probe to that path returns `404 page not found`) - Detect the NVCF `Function '<uuid>': Not found for account '<id>'` failure and surface an actionable message that points the user at picking a different model on https://build.nvidia.com - Tighten the per-validation-probe `--max-time` from the default 60s to 15s so a hung NVIDIA backend (confirmed live on `mistralai/mistral-medium-3-instruct`) cannot block the wizard for a full minute per failure mode - Closes NVIDIA#1601 ## Problem Three quirks of NVIDIA Build (https://integrate.api.nvidia.com/v1) combine to make custom-model selection in `nemoclaw onboard` look broken even when the model would otherwise work. The default Nemotron path is unaffected because it picks models that happen to mask all three behaviors. The reporter (**@kaviin27** in the NVIDIA Developer Discord) saw this exact failure pasted in their bug report: > ``` > Responses API with tool calling: HTTP 404: 404 page not found > Chat Completions API: HTTP 404: Function 'ZZZZ': Not found for account 'XXXX' > ``` I reproduced both halves of that against a live free-tier NVIDIA Build key and traced them to three independent code paths. ### 1. Responses probe always 404s on NVIDIA Build [`bin/lib/onboard.js:probeOpenAiLikeEndpoint()`](bin/lib/onboard.js) probes `/v1/responses` first when `requireResponsesToolCalling` is true (which is gated on `provider === 'nvidia-prod'`). NVIDIA Build does not expose `/v1/responses` for **any** model — even nemotron-49b returns `404 page not found`. The probe loop silently falls through to chat completions on success, so working models still onboard, but the noisy 404 leaks into error messages whenever chat completions also fails. That's why @kaviin27's error string had both halves. Fix: skip the Responses probe entirely for nvidia-prod via a new `shouldSkipResponsesProbe()` helper. ### 2. NVCF `Function not found for account` is opaque to the user Many catalog models return: ```json {"status":404,"title":"Not Found","detail":"Function 'a1b2c3': Not found for account 'xyz'"} ``` This means the model is in the public catalog but is not deployed for the user's account/org. The current code joins all probe failures into one raw string and surfaces the NVCF body verbatim, leaving the user with no idea what to do. Fix: detect the pattern with `isNvcfFunctionNotFoundForAccount()` and replace the message with `nvcfFunctionNotFoundMessage(model)`, which says: > Model '<id>' is in the NVIDIA Build catalog but is not deployed for your account. Pick a different model, or check the model card on https://build.nvidia.com to see if it requires org-level access. ### 3. Validation probes inherit a 60s curl timeout `getCurlTimingArgs()` returns `--max-time 60`. That is fine for inference calls, but a misbehaving model (confirmed: `mistralai/mistral-medium-3-instruct` hangs 30+ seconds with zero bytes returned, 3 attempts in a row) should not block the wizard for a full minute per failure mode. The user-visible result is a wizard that just sits there for ages on a model the user can't fix. Fix: add `getValidationProbeCurlArgs()` that returns `--max-time 15` and use it in `probeOpenAiLikeEndpoint` and `probeResponsesToolCalling`. ## Test plan - [x] **6 new unit tests** in `src/lib/validation.test.ts` covering `isNvcfFunctionNotFoundForAccount`, `nvcfFunctionNotFoundMessage`, and `shouldSkipResponsesProbe` — including the literal NVCF error string captured from a live API run - [x] All 30 validation tests pass: `npx vitest run src/lib/validation.test.ts` → 30/30 - [x] Existing onboard test suites unaffected — verified `test/onboard-selection.test.js` and `test/onboard-readiness.test.js` pass - [x] **Live reproduction** against `https://integrate.api.nvidia.com/v1`: - `nvidia/llama-3.3-nemotron-super-49b-v1` chat → 200 in 0.67s (control) - `mistralai/mistral-large` chat → `404 Function 'XXX': Not found for account 'YYY'` (NVCF account-not-deployed error) - `mistralai/mistral-medium-3-instruct` chat → 30+ second hang, no bytes - `/v1/responses` for any model → `404 page not found` - [x] `npx prettier --check` clean - [x] `npx eslint` clean - [x] Signed commit + DCO sign-off Closes NVIDIA#1601 Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved detection of NVIDIA “function not found” failures and return of a clearer, model-specific message and guidance. * Validation now preserves raw probe response bodies for better failure reporting. * **Performance** * Tighter per-probe timeouts and optional skipping of unnecessary Responses probes for specific providers to reduce noise and speed up validation. * **Tests** * Added unit tests covering the new validation helpers and probe-skip behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Summary
nvidia-prod(NVIDIA Build does not expose/v1/responsesfor any model — every probe to that path returns404 page not found)Function '<uuid>': Not found for account '<id>'failure and surface an actionable message that points the user at picking a different model on https://build.nvidia.com--max-timefrom the default 60s to 15s so a hung NVIDIA backend (confirmed live onmistralai/mistral-medium-3-instruct) cannot block the wizard for a full minute per failure modeProblem
Three quirks of NVIDIA Build (https://integrate.api.nvidia.com/v1) combine to make custom-model selection in
nemoclaw onboardlook broken even when the model would otherwise work. The default Nemotron path is unaffected because it picks models that happen to mask all three behaviors.The reporter (@kaviin27 in the NVIDIA Developer Discord) saw this exact failure pasted in their bug report:
I reproduced both halves of that against a live free-tier NVIDIA Build key and traced them to three independent code paths.
1. Responses probe always 404s on NVIDIA Build
bin/lib/onboard.js:probeOpenAiLikeEndpoint()probes/v1/responsesfirst whenrequireResponsesToolCallingis true (which is gated onprovider === 'nvidia-prod'). NVIDIA Build does not expose/v1/responsesfor any model — even nemotron-49b returns404 page not found. The probe loop silently falls through to chat completions on success, so working models still onboard, but the noisy 404 leaks into error messages whenever chat completions also fails. That's why @kaviin27's error string had both halves.Fix: skip the Responses probe entirely for nvidia-prod via a new
shouldSkipResponsesProbe()helper.2. NVCF
Function not found for accountis opaque to the userMany catalog models return:
{"status":404,"title":"Not Found","detail":"Function 'a1b2c3': Not found for account 'xyz'"}This means the model is in the public catalog but is not deployed for the user's account/org. The current code joins all probe failures into one raw string and surfaces the NVCF body verbatim, leaving the user with no idea what to do.
Fix: detect the pattern with
isNvcfFunctionNotFoundForAccount()and replace the message withnvcfFunctionNotFoundMessage(model), which says:3. Validation probes inherit a 60s curl timeout
getCurlTimingArgs()returns--max-time 60. That is fine for inference calls, but a misbehaving model (confirmed:mistralai/mistral-medium-3-instructhangs 30+ seconds with zero bytes returned, 3 attempts in a row) should not block the wizard for a full minute per failure mode. The user-visible result is a wizard that just sits there for ages on a model the user can't fix.Fix: add
getValidationProbeCurlArgs()that returns--max-time 15and use it inprobeOpenAiLikeEndpointandprobeResponsesToolCalling.Test plan
src/lib/validation.test.tscoveringisNvcfFunctionNotFoundForAccount,nvcfFunctionNotFoundMessage, andshouldSkipResponsesProbe— including the literal NVCF error string captured from a live API runnpx vitest run src/lib/validation.test.ts→ 30/30test/onboard-selection.test.jsandtest/onboard-readiness.test.jspasshttps://integrate.api.nvidia.com/v1:nvidia/llama-3.3-nemotron-super-49b-v1chat → 200 in 0.67s (control)mistralai/mistral-largechat →404 Function 'XXX': Not found for account 'YYY'(NVCF account-not-deployed error)mistralai/mistral-medium-3-instructchat → 30+ second hang, no bytes/v1/responsesfor any model →404 page not foundnpx prettier --checkcleannpx eslintcleanCloses #1601
Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com
Summary by CodeRabbit
Bug Fixes
Performance
Tests