Skip to content

AI-22071 Add MLServer runtime#14

Merged
fyuan1316 merged 2 commits intomasterfrom
add-mlserver-runtime
Aug 15, 2025
Merged

AI-22071 Add MLServer runtime#14
fyuan1316 merged 2 commits intomasterfrom
add-mlserver-runtime

Conversation

@fyuan1316
Copy link
Copy Markdown
Contributor

@fyuan1316 fyuan1316 commented Aug 15, 2025

Refactor extend runtimes

Add MLServer runtime

Summary by CodeRabbit

  • Documentation
    • Reorganized custom inference runtime guide with a new "Configuration Examples for Runtimes" section.
    • Added tabbed templates for MLServer and Xinference across GPU, NPU, and CPU, including environment variables, probes, resource settings, and supported model formats.
    • Updated publishing workflow: renamed sections, clarified runtime selection, added a step for setting environment variables.
    • Consolidated inline examples; example filename changed to your-runtime.yaml.
    • Expanded environment-variable guidance, including MODEL_FAMILY.

Add MLServer runtime

Signed-off-by: Yuan Fang <yuanfang@alauda.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 15, 2025

Walkthrough

Reorganized the custom inference runtime docs: removed inline Xinference-only examples, added a tabbed "Configuration Examples for Runtimes" (MLServer and Xinference for GPU/NPU/CPU), generalized publishing steps and headings, updated example filenames, and expanded environment-variable guidance (including MODEL_FAMILY).

Changes

Cohort / File(s) Summary
Docs: Custom Inference Runtime refactor
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
Replaced inline Xinference YAML with a tabbed “Configuration Examples for Runtimes” (MLServer; Xinference: GPU/NPU/CPU). Generalized publishing flow and step names, added Step for environment variables, renamed example file to your-runtime.yaml, documented MODEL_FAMILY, and adjusted wording to “custom runtime.”

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Docs as Documentation
  participant ControlPlane as Inference Service API
  participant RuntimePod as Custom Runtime Pod

  User->>Docs: Read publish flow & runtime templates
  Docs-->>User: Provide MLServer/Xinference YAML examples
  User->>ControlPlane: Submit InferenceService using chosen runtime YAML
  ControlPlane->>RuntimePod: Create Pod with env vars, probes, resources
  RuntimePod-->>ControlPlane: Startup probe/health checks pass
  ControlPlane-->>User: Inference service becomes available
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested reviewers

  • typhoonzero
  • zhaomingkun1030

Poem

I thump my paw—tabs spread with care,
MLServer, Xinference, kernels bare.
GPUs hum, NPUs softly sing,
Env vars set — the startup spring.
Hop, deploy; the runtime’s there. 🥕

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5702454 and 274bb08.

📒 Files selected for processing (1)
  • docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-mlserver-runtime

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (5)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (5)

45-56: Good reorg of Step 1; minor copy edit.

The step reads clearly and properly points to the new examples. One tiny nit: “examples below” is correct now, thanks for updating the tip.

  • Minor copy edit suggestion in the step title later (“Set Environment Variables(if needed)” → “Set Environment Variables (if needed)”) — see separate comment.

71-75: Add a space before the parenthesis in the step title.

Subject-verb agreement and spacing: “Set Environment Variables(if needed)” → “Set Environment Variables (if needed)”.

Apply:

-4.  **Set Environment Variables(if needed)**:
+4.  **Set Environment Variables (if needed)**:

227-233: Consider adding a securityContext for consistency.

The MLServer example hardens the container. For parity, add the same securityContext to Xinference examples unless your runtime requires elevated privileges.


279-279: Avoid installing packages at container startup.

pip install transformers~=4.49.0 in the startup script introduces network dependency, longer cold starts, and potential reproducibility issues. Bake this into the image instead.


356-366: CPU template looks good; labels minor nit.

For CPU, consider omitting cpaas.io/cuda-version entirely rather than setting it to an empty string.

Apply:

-      cpaas.io/cuda-version: ""
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1bce989 and 5702454.

📒 Files selected for processing (1)
  • docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (1 hunks)
🔇 Additional comments (4)
docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx (4)

52-53: LGTM: kubectl apply example is correct.

Example file naming now matches the generalized runtime guidance.


121-129: LGTM: startupProbe path is consistent with V2 ready endpoint.

The probe should work with MLServer’s V2 endpoints on port 8080.


137-142: Verify supported model formats vs. MLServer runtime.

Declaring both mlflow and transformers under one MLServer runtime may not reflect what the provided image actually supports out of the box. SKLearn via MLServer is fine; MLflow is typically supported; “transformers” may require a custom implementation.

If the image doesn’t include a Transformers runtime, consider removing it or adding the appropriate implementation.


447-456: Environment variable guidance is clear.

The MODEL_FAMILY explanation and example are helpful and necessary for Xinference.

Comment on lines +96 to +98
labels:
modelClass: mlserver_sklearn.SKLearnModel
containers:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Non-standard field under spec: labels → move modelClass to env var.

spec.labels is not a valid field for KServe ClusterServingRuntime spec and will be ignored by the API. If you intended to configure MLServer’s model implementation, use the environment variable MLSERVER_MODEL_IMPLEMENTATION.

Apply:

-    labels:
-      modelClass: mlserver_sklearn.SKLearnModel
+    # Configure MLServer model implementation via environment variables instead of spec.labels

Then, in the container env list, add:

         env:
+          - name: MLSERVER_MODEL_IMPLEMENTATION
+            value: mlserver_sklearn.SKLearnModel
           - name: MLSERVER_MODEL_URI
             value: /mnt/models
           - name: MLSERVER_MODEL_NAME
             value: '{{.Name}}'

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
around lines 96 to 98, the example places modelClass under spec.labels which is
not a valid KServe ClusterServingRuntime field; remove the spec.labels block and
instead set the MLServer model implementation by adding an environment variable
entry in the container spec named MLSERVER_MODEL_IMPLEMENTATION with the value
mlserver_sklearn.SKLearnModel; update the container's env list accordingly so
the runtime reads the model implementation from the env var.

Comment on lines +318 to +327
# 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.

GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")

echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}"

if [ ${GPU_COUNT} -lt 1 ]; then
echo "No GPUs found. Please check if the container have aquired any GPU device"
exit 1
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

NPU variable naming and messages refer to GPU.

  • Use NPU_COUNT and NPU terminology.
  • Grammar fix: “have acquired”.
  • Verify whether --n-gpu is the correct flag for NPU parallelism in Xinference. If not, adjust accordingly.

Apply:

-            # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
-
-            GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
-
-            echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}"
-
-            if [ ${GPU_COUNT} -lt 1 ]; then
-                echo "No GPUs found. Please check if the container have aquired any GPU device"
+            # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
+            NPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
+            echo "Starting serving model uid: ${MODEL_UID}, num NPUs: ${NPU_COUNT}"
+            if [ "${NPU_COUNT}" -lt 1 ]; then
+                echo "No NPUs found. Please check if the container has acquired any NPU device"
                 exit 1
             fi
@@
-            xinference launch --model_path ${MODEL_PATH} --model-engine ${MODEL_ENGINE} -u ${MODEL_UID} -n ${MODEL_FAMILY} --n-gpu ${GPU_COUNT} -e http://127.0.0.1:8080 $@
+            xinference launch --model_path "${MODEL_PATH}" --model-engine "${MODEL_ENGINE}" -u "${MODEL_UID}" -n "${MODEL_FAMILY}" --n-gpu "${NPU_COUNT}" -e http://127.0.0.1:8080 "$@"

Please confirm whether xinference launch supports a dedicated NPU parallelism flag (replace --n-gpu if needed).

Also applies to: 334-335


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx"
echo "File path: $FILE"
if [ -f "$FILE" ]; then
  echo "File exists: yes"
else
  echo "File exists: no"
  exit 0
fi

echo
echo "---- lines 300-360 of the file ----"
sed -n '300,360p' "$FILE" || true

echo
echo "---- Searching repo for related terms ----"
rg -n --hidden -S "torch_npu.npu.device_count|torch_npu|--n-gpu|--n-npu|--npu|\\bnpu\\b|NPU_COUNT|npu_count|MODEL_UID|MODEL_NAME|xinference launch|\\bxinference\\b" || true

echo
echo "---- Searching specifically for occurrences of '--n-gpu' ----"
rg -n --hidden -S -- '--n-gpu' || true

Length of output: 9765


🌐 Web query:

Does Xinference CLI support a dedicated NPU parallelism flag (e.g., --n-npu or --npu), or does it use --n-gpu for accelerator count? Where is this documented (xinference CLI launch flags)?

💡 Result:

As of the latest available documentation, Xinference CLI does not provide a dedicated NPU-specific parallelism flag such as --n-npu or --npu for specifying the number of NPUs to use. Instead, the typical accelerator count flag used in Xinference is --n-gpu, which is for GPU parallelism; there is no mention of a direct NPU count flag in the core CLI launch flags or model launch interfaces documented for Xinference[5][1][3].

Documentation details:

  • The official user and internal docs outline commands like xinference launch and related entry points[1]. They include flags for model management, GPUs (--n-gpu), and engine selection but do not specify a flag for NPUs specifically.
  • When launching models, you select the inference engine via --model-engine, e.g., vllm, SGLang, llama.cpp, etc., but dedicated NPU control is not outlined in available Xinference CLI flags[5].
  • Models can be launched with various engine-specific parameters, which might pass directly to the underlying inference engine, but for acceleration hardware, Xinference currently documents only GPU-related knobs[5].

Where is this documented?

Summary Table: Xinference CLI Hardware Parallelism Flags

Accelerator Xinference Flag Documented Source
GPU --n-gpu [5]
NPU No dedicated flag [1][5][3]

Additional Notes:
Other machine learning inference tools (e.g., TFLite) do support dedicated NPU/GPU selection flags (like --use_nnapi=true and --nnapi_accelerator_name=...)[2], but such options are not present in the Xinference CLI currently. If your use case requires NPU scheduling, it might depend on specific engine integration or external system configuration, and you would need to check for updates or consult engine-level documentation.

If Xinference adds explicit NPU flags in future releases, they would be listed in the CLI help output or documented in the official CLI flag documentation[1][5]. As of August 2025, this is not present.

Citations:


Rename GPU vars/messages to NPU in the NPU runtime block; keep --n-gpu

Verified changes to make (very short): rename GPU_COUNT -> NPU_COUNT, update messaging to NPU, fix grammar, quote variables in the xinference command. Confirmed: Xinference does not provide a dedicated NPU flag — keep using --n-gpu and pass the NPU count.

Files / locations:

  • docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx — lines ~318-327: rename variable and messages.
  • docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx — lines ~334-335: use NPU_COUNT in xinference launch and quote args.

Suggested diff:

-            # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
-
-            GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
-
-            echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}"
-
-            if [ ${GPU_COUNT} -lt 1 ]; then
-                echo "No GPUs found. Please check if the container have aquired any GPU device"
+            # 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
+            NPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
+            echo "Starting serving model uid: ${MODEL_UID}, num NPUs: ${NPU_COUNT}"
+            if [ "${NPU_COUNT}" -lt 1 ]; then
+                echo "No NPUs found. Please check if the container has acquired any NPU device"
                 exit 1
             fi
@@
-            xinference launch --model_path ${MODEL_PATH} --model-engine ${MODEL_ENGINE} -u ${MODEL_UID} -n ${MODEL_FAMILY} --n-gpu ${GPU_COUNT} -e http://127.0.0.1:8080 $@
+            xinference launch --model_path "${MODEL_PATH}" --model-engine "${MODEL_ENGINE}" -u "${MODEL_UID}" -n "${MODEL_FAMILY}" --n-gpu "${NPU_COUNT}" -e http://127.0.0.1:8080 "$@"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
GPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}"
if [ ${GPU_COUNT} -lt 1 ]; then
echo "No GPUs found. Please check if the container have aquired any GPU device"
exit 1
fi
# 1. Check the number of available NPUs to decide whether to enable multi-NPU tensor parallelism.
NPU_COUNT=$(python3 -c "import torch_npu; print(torch_npu.npu.device_count())")
echo "Starting serving model uid: ${MODEL_UID}, num NPUs: ${NPU_COUNT}"
if [ "${NPU_COUNT}" -lt 1 ]; then
echo "No NPUs found. Please check if the container has acquired any NPU device"
exit 1
fi
xinference launch --model_path "${MODEL_PATH}" --model-engine "${MODEL_ENGINE}" -u "${MODEL_UID}" -n "${MODEL_FAMILY}" --n-gpu "${NPU_COUNT}" -e http://127.0.0.1:8080 "$@"
🤖 Prompt for AI Agents
In docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
around lines 318-327 and 334-335, rename the GPU-specific variable and messages
to NPU (change GPU_COUNT -> NPU_COUNT), update the echo messages to reference
NPU and fix grammar, and in the xinference launch lines replace the GPU variable
with NPU_COUNT while keeping the --n-gpu flag and quoting the variable (e.g.,
"--n-gpu" "$NPU_COUNT") so the count is passed correctly and safely.

@fyuan1316 fyuan1316 merged commit cffa695 into master Aug 15, 2025
2 checks passed
fyuan1316 added a commit that referenced this pull request Aug 18, 2025
fyuan1316 added a commit that referenced this pull request Aug 18, 2025
@coderabbitai coderabbitai bot mentioned this pull request Jan 6, 2026
@coderabbitai coderabbitai bot mentioned this pull request Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant