Conversation
Add MLServer runtime Signed-off-by: Yuan Fang <yuanfang@alauda.io>
WalkthroughReorganized the custom inference runtime docs: removed inline Xinference-only examples, added a tabbed "Configuration Examples for Runtimes" (MLServer and Xinference for GPU/NPU/CPU), generalized publishing steps and headings, updated example filenames, and expanded environment-variable guidance (including MODEL_FAMILY). Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Docs as Documentation
participant ControlPlane as Inference Service API
participant RuntimePod as Custom Runtime Pod
User->>Docs: Read publish flow & runtime templates
Docs-->>User: Provide MLServer/Xinference YAML examples
User->>ControlPlane: Submit InferenceService using chosen runtime YAML
ControlPlane->>RuntimePod: Create Pod with env vars, probes, resources
RuntimePod-->>ControlPlane: Startup probe/health checks pass
ControlPlane-->>User: Inference service becomes available
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (5)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (5)
45-56: Good reorg of Step 1; minor copy edit.The step reads clearly and properly points to the new examples. One tiny nit: “examples below” is correct now, thanks for updating the tip.
- Minor copy edit suggestion in the step title later (“Set Environment Variables(if needed)” → “Set Environment Variables (if needed)”) — see separate comment.
71-75: Add a space before the parenthesis in the step title.Subject-verb agreement and spacing: “Set Environment Variables(if needed)” → “Set Environment Variables (if needed)”.
Apply:
-4. **Set Environment Variables(if needed)**: +4. **Set Environment Variables (if needed)**:
227-233: Consider adding a securityContext for consistency.The MLServer example hardens the container. For parity, add the same securityContext to Xinference examples unless your runtime requires elevated privileges.
279-279: Avoid installing packages at container startup.
pip install transformers~=4.49.0in the startup script introduces network dependency, longer cold starts, and potential reproducibility issues. Bake this into the image instead.
356-366: CPU template looks good; labels minor nit.For CPU, consider omitting
cpaas.io/cuda-versionentirely rather than setting it to an empty string.Apply:
- cpaas.io/cuda-version: ""
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx(1 hunks)
🔇 Additional comments (4)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (4)
52-53: LGTM: kubectl apply example is correct.Example file naming now matches the generalized runtime guidance.
121-129: LGTM: startupProbe path is consistent with V2 ready endpoint.The probe should work with MLServer’s V2 endpoints on port 8080.
137-142: Verify supported model formats vs. MLServer runtime.Declaring both
mlflowandtransformersunder one MLServer runtime may not reflect what the provided image actually supports out of the box. SKLearn via MLServer is fine; MLflow is typically supported; “transformers” may require a custom implementation.If the image doesn’t include a Transformers runtime, consider removing it or adding the appropriate implementation.
447-456: Environment variable guidance is clear.The
MODEL_FAMILYexplanation and example are helpful and necessary for Xinference.
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Show resolved
Hide resolved
| labels: | ||
| modelClass: mlserver_sklearn.SKLearnModel | ||
| containers: |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Non-standard field under spec: labels → move modelClass to env var.
spec.labels is not a valid field for KServe ClusterServingRuntime spec and will be ignored by the API. If you intended to configure MLServer’s model implementation, use the environment variable MLSERVER_MODEL_IMPLEMENTATION.
Apply:
- labels:
- modelClass: mlserver_sklearn.SKLearnModel
+ # Configure MLServer model implementation via environment variables instead of spec.labelsThen, in the container env list, add:
env:
+ - name: MLSERVER_MODEL_IMPLEMENTATION
+ value: mlserver_sklearn.SKLearnModel
- name: MLSERVER_MODEL_URI
value: /mnt/models
- name: MLSERVER_MODEL_NAME
value: '{{.Name}}'Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
around lines 96 to 98, the example places modelClass under spec.labels which is
not a valid KServe ClusterServingRuntime field; remove the spec.labels block and
instead set the MLServer model implementation by adding an environment variable
entry in the container spec named MLSERVER_MODEL_IMPLEMENTATION with the value
mlserver_sklearn.SKLearnModel; update the container's env list accordingly so
the runtime reads the model implementation from the env var.
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Show resolved
Hide resolved
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Show resolved
Hide resolved
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Outdated
Show resolved
Hide resolved
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Show resolved
Hide resolved
| # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism. | ||
|
|
||
| GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())") | ||
|
|
||
| echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}" | ||
|
|
||
| if [ ${GPU_COUNT} -lt 1 ]; then | ||
| echo "No GPUs found. Please check if the container have aquired any GPU device" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
NPU variable naming and messages refer to GPU.
- Use
NPU_COUNTand NPU terminology. - Grammar fix: “have acquired”.
- Verify whether
--n-gpuis the correct flag for NPU parallelism in Xinference. If not, adjust accordingly.
Apply:
- # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
-
- GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
-
- echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}"
-
- if [ ${GPU_COUNT} -lt 1 ]; then
- echo "No GPUs found. Please check if the container have aquired any GPU device"
+ # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
+ NPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
+ echo "Starting serving model uid: ${MODEL_UID}, num NPUs: ${NPU_COUNT}"
+ if [ "${NPU_COUNT}" -lt 1 ]; then
+ echo "No NPUs found. Please check if the container has acquired any NPU device"
exit 1
fi
@@
- xinference launch --model_path ${MODEL_PATH} --model-engine ${MODEL_ENGINE} -u ${MODEL_UID} -n ${MODEL_FAMILY} --n-gpu ${GPU_COUNT} -e http://127.0.0.1:8080 $@
+ xinference launch --model_path "${MODEL_PATH}" --model-engine "${MODEL_ENGINE}" -u "${MODEL_UID}" -n "${MODEL_FAMILY}" --n-gpu "${NPU_COUNT}" -e http://127.0.0.1:8080 "$@"Please confirm whether xinference launch supports a dedicated NPU parallelism flag (replace --n-gpu if needed).
Also applies to: 334-335
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx"
echo "File path: $FILE"
if [ -f "$FILE" ]; then
echo "File exists: yes"
else
echo "File exists: no"
exit 0
fi
echo
echo "---- lines 300-360 of the file ----"
sed -n '300,360p' "$FILE" || true
echo
echo "---- Searching repo for related terms ----"
rg -n --hidden -S "torch_npu.npu.device_count|torch_npu|--n-gpu|--n-npu|--npu|\\bnpu\\b|NPU_COUNT|npu_count|MODEL_UID|MODEL_NAME|xinference launch|\\bxinference\\b" || true
echo
echo "---- Searching specifically for occurrences of '--n-gpu' ----"
rg -n --hidden -S -- '--n-gpu' || trueLength of output: 9765
🌐 Web query:
Does Xinference CLI support a dedicated NPU parallelism flag (e.g., --n-npu or --npu), or does it use --n-gpu for accelerator count? Where is this documented (xinference CLI launch flags)?
💡 Result:
As of the latest available documentation, Xinference CLI does not provide a dedicated NPU-specific parallelism flag such as --n-npu or --npu for specifying the number of NPUs to use. Instead, the typical accelerator count flag used in Xinference is --n-gpu, which is for GPU parallelism; there is no mention of a direct NPU count flag in the core CLI launch flags or model launch interfaces documented for Xinference[5][1][3].
Documentation details:
- The official user and internal docs outline commands like
xinference launchand related entry points[1]. They include flags for model management, GPUs (--n-gpu), and engine selection but do not specify a flag for NPUs specifically. - When launching models, you select the inference engine via
--model-engine, e.g., vllm, SGLang, llama.cpp, etc., but dedicated NPU control is not outlined in available Xinference CLI flags[5]. - Models can be launched with various engine-specific parameters, which might pass directly to the underlying inference engine, but for acceleration hardware, Xinference currently documents only GPU-related knobs[5].
Where is this documented?
- See: Xinference Documentation – Using Xinference (Latest)[5].
- For internals and flags: Xinference Internals[1].
- Custom model launching and engine arguments are described, but again, only general engine flags and possible GPU options (
--gpu_memory_utilization, etc.), not NPU-specific flags[3][5].
Summary Table: Xinference CLI Hardware Parallelism Flags
| Accelerator | Xinference Flag | Documented Source |
|---|---|---|
| GPU | --n-gpu |
[5] |
| NPU | No dedicated flag | [1][5][3] |
Additional Notes:
Other machine learning inference tools (e.g., TFLite) do support dedicated NPU/GPU selection flags (like --use_nnapi=true and --nnapi_accelerator_name=...)[2], but such options are not present in the Xinference CLI currently. If your use case requires NPU scheduling, it might depend on specific engine integration or external system configuration, and you would need to check for updates or consult engine-level documentation.
If Xinference adds explicit NPU flags in future releases, they would be listed in the CLI help output or documented in the official CLI flag documentation[1][5]. As of August 2025, this is not present.
Citations:
- 1: https://inference.readthedocs.io/en/v1.4.0/development/xinference_internals.html
- 2: https://community.nxp.com/t5/i-MX-Processors/How-to-tell-when-inference-is-ran-by-GPU-or-NPU-on-iMX8MPlus/m-p/1278808/?profile.language=en
- 3: https://inference.readthedocs.io/en/v0.14.2/models/custom.html
- 4: https://huggingface.co/datasets/fdaudens/hf-blog-posts-split/viewer
- 5: https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html
- 6: https://lists.opensuse.org/archives/list/factory@lists.opensuse.org/latest?count=200&page=1
- 7: https://peter.sh/experiments/chromium-command-line-switches/
- 8: https://huggingface.co/datasets/fdaudens/hf-blog-posts/viewer/default/train
- 9: https://grafana.com/docs/alloy/latest/reference/cli/run/
- 10: https://news.ycombinator.com/item?id=40417828
Rename GPU vars/messages to NPU in the NPU runtime block; keep --n-gpu
Verified changes to make (very short): rename GPU_COUNT -> NPU_COUNT, update messaging to NPU, fix grammar, quote variables in the xinference command. Confirmed: Xinference does not provide a dedicated NPU flag — keep using --n-gpu and pass the NPU count.
Files / locations:
- docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx — lines ~318-327: rename variable and messages.
- docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx — lines ~334-335: use NPU_COUNT in xinference launch and quote args.
Suggested diff:
- # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
-
- GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
-
- echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}"
-
- if [ ${GPU_COUNT} -lt 1 ]; then
- echo "No GPUs found. Please check if the container have aquired any GPU device"
+ # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
+ NPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
+ echo "Starting serving model uid: ${MODEL_UID}, num NPUs: ${NPU_COUNT}"
+ if [ "${NPU_COUNT}" -lt 1 ]; then
+ echo "No NPUs found. Please check if the container has acquired any NPU device"
exit 1
fi
@@
- xinference launch --model_path ${MODEL_PATH} --model-engine ${MODEL_ENGINE} -u ${MODEL_UID} -n ${MODEL_FAMILY} --n-gpu ${GPU_COUNT} -e http://127.0.0.1:8080 $@
+ xinference launch --model_path "${MODEL_PATH}" --model-engine "${MODEL_ENGINE}" -u "${MODEL_UID}" -n "${MODEL_FAMILY}" --n-gpu "${NPU_COUNT}" -e http://127.0.0.1:8080 "$@"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism. | |
| GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())") | |
| echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}" | |
| if [ ${GPU_COUNT} -lt 1 ]; then | |
| echo "No GPUs found. Please check if the container have aquired any GPU device" | |
| exit 1 | |
| fi | |
| # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism. | |
| NPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())") | |
| echo "Starting serving model uid: ${MODEL_UID}, num NPUs: ${NPU_COUNT}" | |
| if [ "${NPU_COUNT}" -lt 1 ]; then | |
| echo "No NPUs found. Please check if the container has acquired any NPU device" | |
| exit 1 | |
| fi | |
| xinference launch --model_path "${MODEL_PATH}" --model-engine "${MODEL_ENGINE}" -u "${MODEL_UID}" -n "${MODEL_FAMILY}" --n-gpu "${NPU_COUNT}" -e http://127.0.0.1:8080 "$@" |
🤖 Prompt for AI Agents
In docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
around lines 318-327 and 334-335, rename the GPU-specific variable and messages
to NPU (change GPU_COUNT -> NPU_COUNT), update the echo messages to reference
NPU and fix grammar, and in the xinference launch lines replace the GPU variable
with NPU_COUNT while keeping the --n-gpu flag and quoting the variable (e.g.,
"--n-gpu" "$NPU_COUNT") so the count is passed correctly and safely.
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Show resolved
Hide resolved
This reverts commit cffa695.
Refactor extend runtimes
Add MLServer runtime
Summary by CodeRabbit