Skip to content

MidasMining/universal-translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Universal Translator

A TP-topology-flexibility framework for vLLM. Lets models load on tensor-parallelism configurations they weren't designed for, with measurable, bounded cost.

Status

Phase 1 pilot complete. RFC v2 drafted post-pilot, post-cross-LLM-review. Filed at vllm-project/vllm as an [RFC] PR.

  • RFC v2: rfc-v2.md — the proposal (post-pilot, 678 lines)
  • Pilot results: pilot-results.md — measured throughput and correctness on Ling-2.6-flash at TP=8 on 8× A4000 PCIe
  • Cross-LLM review methodology: cross-llm-review-methodology.md — the panel-of-LLMs review pattern that shaped this work, with the Universal Translator as a worked example

TL;DR

vLLM models frequently hard-code TP constraints in their kernels — assertions like tp_size <= group_norm_size, num_kv_heads % tp_size == 0, or implicit assumptions about partial_rotary_factor == 1.0. These are typically defensive fast-paths, not mathematical requirements. The Universal Translator framework lets models declare such constraints and routes around them at runtime, converting hard rejections into measurable-cost choices.

The Phase 1 pilot validated the framework on Ling-2.6-flash at TP=8 on 8× A4000 PCIe — measured ~3.3% per-token overhead from the cross-rank-reduce mitigation, comfortably within the original 5–15% estimate.

What was discovered along the way

The pilot's most interesting finding wasn't the framework itself — it was that the framework's anti-pattern (defensive constraints masquerading as hardware reality) recurs at every layer of the inference stack. Five integration gates were peeled in one pilot, four nested layers deep:

  1. vLLM TP-shape assertion
  2. lightning_attn SMEM ceiling on Ampere (CBLOCK=64 exceeds A4000 limit)
  3. vLLM gate on FLASHINFER_MLA (major == 10, Blackwell only)
  4. FlashInfer dispatch routing (Ampere → XQA, Blackwell-only)
  5. fa2 kernel paged-indexing bug on Ampere bf16 (returns zeros for non-contiguous kv_indices)

Each one looked like hardware reality. None was. The framework's discipline — catalog the constraint, mitigate where math allows, surface a precise error otherwise — applies as a methodology, not just a fixed catalog of TP-shape patterns.

Separable upstream contributions

Three real upstream contributions surfaced from the pilot, each substantially generalizable beyond Ling:

  • vLLM PR: lightning_attn CBLOCK device-aware on Ampere — affects every Lightning-LA model on Ampere consumer hardware (MiniMax-Text01, MiniMax M2/M2.5/M2.7, Ling/Bailing, etc.)
  • FlashInfer issue: fa2 MLA paged-indexing bug — affects every user of BatchMLAPagedAttentionWrapper(backend="fa2") on non-Blackwell, in any inference engine
  • vLLM PR (deferred): FlashInfer-MLA fa2 adapter for Ampere/Hopper bf16 — depends on the upstream FlashInfer fix; opens MLA-class models (Ling, GLM-4.7, DeepSeek-V2/V3) to a new hardware tier

Links to the actual filed PRs/issues will be added below as they land.

Filed artifacts

Pilot artifacts

The pilot fork (vLLM 0.20.1.dev0 with the framework patch applied) and supporting artifacts (unit tests, bench harness, debug writeups) live in the local environment that produced this work. The RFC v2 references them by description; their content is reproducible from the patches in the vLLM PRs above.

License

CC0 for the documents in this repo. The vLLM-side patches inherit Apache 2.0 from the upstream.

About

TP topology flexibility framework for vLLM. Phase 1 pilot validated on Ling-2.6-flash. Includes cross-LLM review methodology used to shape the design.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors