Skip to content

Conversation

@YuhanLiu11
Copy link
Collaborator

This PR adds a tutorial for Gateway Inference Extension support for Production Stack.

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @YuhanLiu11, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the documentation by adding a detailed tutorial for integrating and utilizing the Gateway Inference Extension within a production-grade Kubernetes setup, specifically leveraging vLLM for model serving. The changes aim to provide clear, step-by-step instructions for users to deploy, configure, and manage their inference workloads via the Gateway API, alongside introducing utility scripts for installation and cleanup.

Highlights

  • New Tutorial Document: I've added a comprehensive new tutorial document, tutorials/21-gateway-inference-extension.md, which guides users through setting up and using the Gateway Inference Extension with vLLM in a Kubernetes production environment. This covers everything from prerequisites and environment setup to deploying models, configuring routing, testing, monitoring, and uninstallation.
  • Updated Deployment References: I've updated the src/gateway_inference_extension/README.md and src/gateway_inference_extension/install.sh scripts to reflect a change in the VLLM deployment configuration file name, now referencing configs/vllm/gpu-deployment.yaml instead of vllm-runtime.yaml.
  • New Cleanup Script: I've introduced a new shell script, src/gateway_inference_extension/delete.sh, to provide a convenient way to uninstall all Kubernetes resources deployed as part of the Gateway Inference Extension tutorial, ensuring a clean teardown.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds a tutorial for Gateway Inference Extension support for Production Stack. The tutorial covers setting up the environment, deploying vLLM models, configuring inference resources and gateway routing, testing the setup, monitoring, troubleshooting, and uninstalling the resources. The changes include updates to the README and install/delete scripts to use gpu-deployment.yaml instead of vllm-runtime.yaml, and adding a PORT variable in the README. The tutorial provides a comprehensive guide for users to get started with the Gateway Inference Extension.


# Apply VLLM deployment using the VLLMRuntime CRD
kubectl apply -f configs/vllm/vllm-runtime.yaml
kubectl apply -f configs/vllm/gpu-deployment.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding a comment explaining why gpu-deployment.yaml is preferred over vllm-runtime.yaml. This will help users understand the rationale behind the change.

Suggested change
kubectl apply -f configs/vllm/gpu-deployment.yaml
kubectl apply -f configs/vllm/gpu-deployment.yaml # Use gpu-deployment for better GPU utilization


### 2.1 Understanding vLLM Runtime

The vLLM Runtime is a custom resource that manages model deployments. Please check ``configs/vllm/gpu-deployment.yaml`` for an example config.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It would be helpful to provide a brief explanation of what the configs/vllm/gpu-deployment.yaml file contains and what aspects users might want to customize.

Suggested change
The vLLM Runtime is a custom resource that manages model deployments. Please check ``configs/vllm/gpu-deployment.yaml`` for an example config.
The vLLM Runtime is a custom resource that manages model deployments. Please check ``configs/vllm/gpu-deployment.yaml`` for an example config. This file defines the deployment configuration for vLLM, including the model to use, resource requests, and other settings.

### 3.2 Inference Pool Configuration
For routing to multiple model instances, check ``configs/inferencepool-resources.yaml`` for example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the vLLM Runtime, briefly describe what the configs/inferencepool-resources.yaml file contains and what aspects users might want to customize.

Suggested change
For routing to multiple model instances, check ``configs/inferencepool-resources.yaml`` for example.
For routing to multiple model instances, check ``configs/inferencepool-resources.yaml`` for example. This file configures the InferencePool, which manages the routing and load balancing of requests to multiple model instances.

Comment on lines 237 to 255

To uninstall all the resources installed on the cluster, run the following:

```bash
# Delete the inference extension
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml

# Delete the inference model and pool resources
kubectl delete -f configs/inferencemodel.yaml
kubectl delete -f configs/inferencepool-resources.yaml

# Delete the VLLM deployment
kubectl delete -f configs/vllm/gpu-deployment.yaml

kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system
kubectl delete ns kgateway-system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The uninstall instructions include deleting resources by applying the same manifests used for installation. It's generally safer and more explicit to use kubectl delete with the --ignore-not-found flag to ensure the script doesn't fail if some resources were not created. Also, consider adding a comment about the order of deletion to avoid dependency issues.

Suggested change
To uninstall all the resources installed on the cluster, run the following:
```bash
# Delete the inference extension
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml
# Delete the inference model and pool resources
kubectl delete -f configs/inferencemodel.yaml
kubectl delete -f configs/inferencepool-resources.yaml
# Delete the VLLM deployment
kubectl delete -f configs/vllm/gpu-deployment.yaml
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml
helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system
kubectl delete ns kgateway-system
# Delete the inference extension
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml --ignore-not-found=true
# Delete the inference model and pool resources
kubectl delete -f configs/inferencemodel.yaml --ignore-not-found=true
kubectl delete -f configs/inferencepool-resources.yaml --ignore-not-found=true
# Delete the VLLM deployment
kubectl delete -f configs/vllm/gpu-deployment.yaml --ignore-not-found=true
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml --ignore-not-found=true
# Delete helm releases
helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system
# Delete the namespace last to ensure all resources are removed
kubectl delete ns kgateway-system --ignore-not-found=true

YuhanLiu11 and others added 4 commits July 7, 2025 23:23
Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Copy link
Collaborator

@Shaoting-Feng Shaoting-Feng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Shaoting-Feng Shaoting-Feng merged commit 633b1aa into vllm-project:main Jul 8, 2025
7 checks passed
@YuhanLiu11 YuhanLiu11 deleted the localdev/gie_doc branch July 8, 2025 23:20
Senne-Mennes pushed a commit to Senne-Mennes/production-stack that referenced this pull request Oct 22, 2025
…roject#570)

* Adding tutorial for GIE

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

* format checking

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

* fixing shell format checker

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

* fixing comments from gemini

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

---------

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>
Co-authored-by: Shaoting <shaotingf@uchicago.edu>
Signed-off-by: senne.mennes@capgemini.com <senne.mennes@capgemini.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants