[Doc] Adding tutorial for Gateway Inference Extension support #570

YuhanLiu11 · 2025-07-07T23:20:37Z

This PR adds a tutorial for Gateway Inference Extension support for Production Stack.

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

gemini-code-assist

Summary of Changes

Hello @YuhanLiu11, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the documentation by adding a detailed tutorial for integrating and utilizing the Gateway Inference Extension within a production-grade Kubernetes setup, specifically leveraging vLLM for model serving. The changes aim to provide clear, step-by-step instructions for users to deploy, configure, and manage their inference workloads via the Gateway API, alongside introducing utility scripts for installation and cleanup.

Highlights

New Tutorial Document: I've added a comprehensive new tutorial document, tutorials/21-gateway-inference-extension.md, which guides users through setting up and using the Gateway Inference Extension with vLLM in a Kubernetes production environment. This covers everything from prerequisites and environment setup to deploying models, configuring routing, testing, monitoring, and uninstallation.
Updated Deployment References: I've updated the src/gateway_inference_extension/README.md and src/gateway_inference_extension/install.sh scripts to reflect a change in the VLLM deployment configuration file name, now referencing configs/vllm/gpu-deployment.yaml instead of vllm-runtime.yaml.
New Cleanup Script: I've introduced a new shell script, src/gateway_inference_extension/delete.sh, to provide a convenient way to uninstall all Kubernetes resources deployed as part of the Gateway Inference Extension tutorial, ensuring a clean teardown.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds a tutorial for Gateway Inference Extension support for Production Stack. The tutorial covers setting up the environment, deploying vLLM models, configuring inference resources and gateway routing, testing the setup, monitoring, troubleshooting, and uninstalling the resources. The changes include updates to the README and install/delete scripts to use gpu-deployment.yaml instead of vllm-runtime.yaml, and adding a PORT variable in the README. The tutorial provides a comprehensive guide for users to get started with the Gateway Inference Extension.

gemini-code-assist · 2025-07-07T23:22:14Z

src/gateway_inference_extension/README.md


 # Apply VLLM deployment using the VLLMRuntime CRD
-kubectl apply -f configs/vllm/vllm-runtime.yaml
+kubectl apply -f configs/vllm/gpu-deployment.yaml


Consider adding a comment explaining why gpu-deployment.yaml is preferred over vllm-runtime.yaml. This will help users understand the rationale behind the change.

Suggested change

kubectl apply -f configs/vllm/gpu-deployment.yaml

kubectl apply -f configs/vllm/gpu-deployment.yaml # Use gpu-deployment for better GPU utilization

gemini-code-assist · 2025-07-07T23:22:14Z

tutorials/21-gateway-inference-extension.md

+
+### 2.1 Understanding vLLM Runtime
+
+The vLLM Runtime is a custom resource that manages model deployments. Please check ``configs/vllm/gpu-deployment.yaml`` for an example config. 


It would be helpful to provide a brief explanation of what the configs/vllm/gpu-deployment.yaml file contains and what aspects users might want to customize.

Suggested change

The vLLM Runtime is a custom resource that manages model deployments. Please check ``configs/vllm/gpu-deployment.yaml`` for an example config.

The vLLM Runtime is a custom resource that manages model deployments. Please check ``configs/vllm/gpu-deployment.yaml`` for an example config. This file defines the deployment configuration for vLLM, including the model to use, resource requests, and other settings.

gemini-code-assist · 2025-07-07T23:22:14Z

tutorials/21-gateway-inference-extension.md

+
+### 3.2 Inference Pool Configuration
+
+For routing to multiple model instances, check ``configs/inferencepool-resources.yaml`` for example. 


Similar to the vLLM Runtime, briefly describe what the configs/inferencepool-resources.yaml file contains and what aspects users might want to customize.

Suggested change

For routing to multiple model instances, check ``configs/inferencepool-resources.yaml`` for example.

For routing to multiple model instances, check ``configs/inferencepool-resources.yaml`` for example. This file configures the InferencePool, which manages the routing and load balancing of requests to multiple model instances.

gemini-code-assist · 2025-07-07T23:22:14Z

tutorials/21-gateway-inference-extension.md

+
+To uninstall all the resources installed on the cluster, run the following:
+
+```bash
+# Delete the inference extension
+kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml
+
+# Delete the inference model and pool resources
+kubectl delete -f configs/inferencemodel.yaml
+kubectl delete -f configs/inferencepool-resources.yaml
+
+# Delete the VLLM deployment
+kubectl delete -f configs/vllm/gpu-deployment.yaml
+
+kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml
+
+helm uninstall kgateway -n kgateway-system
+helm uninstall kgateway-crds -n kgateway-system
+kubectl delete ns kgateway-system


The uninstall instructions include deleting resources by applying the same manifests used for installation. It's generally safer and more explicit to use kubectl delete with the --ignore-not-found flag to ensure the script doesn't fail if some resources were not created. Also, consider adding a comment about the order of deletion to avoid dependency issues.

Suggested change

To uninstall all the resources installed on the cluster, run the following:

```bash

# Delete the inference extension

kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml

# Delete the inference model and pool resources

kubectl delete -f configs/inferencemodel.yaml

kubectl delete -f configs/inferencepool-resources.yaml

# Delete the VLLM deployment

kubectl delete -f configs/vllm/gpu-deployment.yaml

kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

helm uninstall kgateway -n kgateway-system

helm uninstall kgateway-crds -n kgateway-system

kubectl delete ns kgateway-system

# Delete the inference extension

kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml --ignore-not-found=true

# Delete the inference model and pool resources

kubectl delete -f configs/inferencemodel.yaml --ignore-not-found=true

kubectl delete -f configs/inferencepool-resources.yaml --ignore-not-found=true

# Delete the VLLM deployment

kubectl delete -f configs/vllm/gpu-deployment.yaml --ignore-not-found=true

kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml --ignore-not-found=true

# Delete helm releases

helm uninstall kgateway -n kgateway-system

helm uninstall kgateway-crds -n kgateway-system

# Delete the namespace last to ensure all resources are removed

kubectl delete ns kgateway-system --ignore-not-found=true

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

Shaoting-Feng

LGTM

…roject#570) * Adding tutorial for GIE Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> * format checking Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> * fixing shell format checker Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> * fixing comments from gemini Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> --------- Signed-off-by: YuhanLiu11 <yliu738@wisc.edu> Co-authored-by: Shaoting <shaotingf@uchicago.edu> Signed-off-by: senne.mennes@capgemini.com <senne.mennes@capgemini.com>

Adding tutorial for GIE

d331768

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

YuhanLiu11 and others added 4 commits July 7, 2025 23:23

format checking

bfa94c9

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

fixing shell format checker

ed70618

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

fixing comments from gemini

955d8b3

Signed-off-by: YuhanLiu11 <yliu738@wisc.edu>

Merge branch 'main' into localdev/gie_doc

1ea0e0a

Shaoting-Feng approved these changes Jul 8, 2025

View reviewed changes

Shaoting-Feng merged commit 633b1aa into vllm-project:main Jul 8, 2025
7 checks passed

YuhanLiu11 deleted the localdev/gie_doc branch July 8, 2025 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Doc] Adding tutorial for Gateway Inference Extension support #570

[Doc] Adding tutorial for Gateway Inference Extension support #570

Uh oh!

YuhanLiu11 commented Jul 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Uh oh!

gemini-code-assist bot Jul 7, 2025

Uh oh!

gemini-code-assist bot Jul 7, 2025

Uh oh!

gemini-code-assist bot Jul 7, 2025

Uh oh!

Shaoting-Feng left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	kubectl apply -f configs/vllm/gpu-deployment.yaml
	kubectl apply -f configs/vllm/gpu-deployment.yaml # Use gpu-deployment for better GPU utilization


		### 2.1 Understanding vLLM Runtime

		The vLLM Runtime is a custom resource that manages model deployments. Please check ``configs/vllm/gpu-deployment.yaml`` for an example config.


		### 3.2 Inference Pool Configuration

		For routing to multiple model instances, check ``configs/inferencepool-resources.yaml`` for example.

[Doc] Adding tutorial for Gateway Inference Extension support #570

[Doc] Adding tutorial for Gateway Inference Extension support #570

Uh oh!

Conversation

YuhanLiu11 commented Jul 7, 2025

PR Title and Classification

Code Quality

DCO and Signed-off-by

What to Expect for the Reviews

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Shaoting-Feng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants