Universal Translator

A TP-topology-flexibility framework for vLLM. Lets models load on tensor-parallelism configurations they weren't designed for, with measurable, bounded cost.

Status

Phase 1 pilot complete. RFC v2 drafted post-pilot, post-cross-LLM-review. Filed at vllm-project/vllm as an [RFC] PR.

RFC v2: rfc-v2.md — the proposal (post-pilot, 678 lines)
Pilot results: pilot-results.md — measured throughput and correctness on Ling-2.6-flash at TP=8 on 8× A4000 PCIe
Cross-LLM review methodology: cross-llm-review-methodology.md — the panel-of-LLMs review pattern that shaped this work, with the Universal Translator as a worked example

TL;DR

vLLM models frequently hard-code TP constraints in their kernels — assertions like tp_size <= group_norm_size, num_kv_heads % tp_size == 0, or implicit assumptions about partial_rotary_factor == 1.0. These are typically defensive fast-paths, not mathematical requirements. The Universal Translator framework lets models declare such constraints and routes around them at runtime, converting hard rejections into measurable-cost choices.

The Phase 1 pilot validated the framework on Ling-2.6-flash at TP=8 on 8× A4000 PCIe — measured ~3.3% per-token overhead from the cross-rank-reduce mitigation, comfortably within the original 5–15% estimate.

What was discovered along the way

The pilot's most interesting finding wasn't the framework itself — it was that the framework's anti-pattern (defensive constraints masquerading as hardware reality) recurs at every layer of the inference stack. Five integration gates were peeled in one pilot, four nested layers deep:

vLLM TP-shape assertion
lightning_attn SMEM ceiling on Ampere (CBLOCK=64 exceeds A4000 limit)
vLLM gate on FLASHINFER_MLA (major == 10, Blackwell only)
FlashInfer dispatch routing (Ampere → XQA, Blackwell-only)
fa2 kernel paged-indexing bug on Ampere bf16 (returns zeros for non-contiguous kv_indices)

Each one looked like hardware reality. None was. The framework's discipline — catalog the constraint, mitigate where math allows, surface a precise error otherwise — applies as a methodology, not just a fixed catalog of TP-shape patterns.

Separable upstream contributions

Three real upstream contributions surfaced from the pilot, each substantially generalizable beyond Ling:

vLLM PR: lightning_attn CBLOCK device-aware on Ampere — affects every Lightning-LA model on Ampere consumer hardware (MiniMax-Text01, MiniMax M2/M2.5/M2.7, Ling/Bailing, etc.)
FlashInfer issue: fa2 MLA paged-indexing bug — affects every user of BatchMLAPagedAttentionWrapper(backend="fa2") on non-Blackwell, in any inference engine
vLLM PR (deferred): FlashInfer-MLA fa2 adapter for Ampere/Hopper bf16 — depends on the upstream FlashInfer fix; opens MLA-class models (Ling, GLM-4.7, DeepSeek-V2/V3) to a new hardware tier

Links to the actual filed PRs/issues will be added below as they land.

Filed artifacts

vLLM RFC PR: vllm-project/vllm#41509
vLLM lightning_attn PR: vllm-project/vllm#41508
FlashInfer fa2 paged-indexing bug: flashinfer-ai/flashinfer#3219

Pilot artifacts

The pilot fork (vLLM 0.20.1.dev0 with the framework patch applied) and supporting artifacts (unit tests, bench harness, debug writeups) live in the local environment that produced this work. The RFC v2 references them by description; their content is reproducible from the patches in the vLLM PRs above.

License

CC0 for the documents in this repo. The vLLM-side patches inherit Apache 2.0 from the upstream.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
cross-llm-review-methodology.md		cross-llm-review-methodology.md
pilot-results.md		pilot-results.md
rfc-v2.md		rfc-v2.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Translator

Status

TL;DR

What was discovered along the way

Separable upstream contributions

Filed artifacts

Pilot artifacts

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Universal Translator

Status

TL;DR

What was discovered along the way

Separable upstream contributions

Filed artifacts

Pilot artifacts

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages