Skip to content
This repository was archived by the owner on May 6, 2026. It is now read-only.

Document GKE TPU Performance#133

Merged
aojea merged 2 commits into
google:mainfrom
aojea:tpuperf
Jun 23, 2025
Merged

Document GKE TPU Performance#133
aojea merged 2 commits into
google:mainfrom
aojea:tpuperf

Conversation

@aojea
Copy link
Copy Markdown
Contributor

@aojea aojea commented Jun 23, 2025

It turns out some process like cilium pin their program to the bpf filesystem so we need to delete them to be able to remove the bpf programs, or we'll not be able to detach them because they are still referenced.

Add documentation about how to maximize TCP throughput on TPU v6 machines, by using two virtual interfaces that map to the two physical interfaces of the physical VM, @samos123 you'll be interested on this

Comment thread site/content/docs/user/gke-tpu-performance.md Outdated
Comment thread site/content/docs/user/gke-tpu-performance.md Outdated
Comment thread site/content/docs/user/gke-tpu-performance.md Outdated

Another important factor is the capacity of DraNet to pass Interface configuration options that allow to tune the interfaces for maximume performance, per example, [Big TCP](https://lwn.net/Articles/884104/).

In addition, if you have GVNIC enabled you can use some private ethtool flags that improve the performance for TCP like [enable-max-rx-buffer-size](enable-max-rx-buffer-size).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explain what these flags do and why we are setting these values somewhat? Especially the private flags.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aojea added 2 commits June 23, 2025 21:32
Change-Id: Ic2cbf3ce0a4f40932268050db7f1d3ff3053429f
Change-Id: I535e6deebb36a861c32ded91f33549815e7f0275
@aojea aojea merged commit 3eea1a6 into google:main Jun 23, 2025
7 checks passed
gcloud compute --project=${PROJECT?} \
networks subnets create \
tpu-net-2-sub \
--network=tpu-net-1 \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea this should be tpu-net-2

@samos123
Copy link
Copy Markdown

It would be helpful to have a before and after picture. with hostNetwork=true vs with dranet.

I suspect dranet perf is actually much better.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants