Summary
I am attempting to reproduce the InferenceMAX GPT-OSS-120B benchmarks on RunPod B200 but my vLLM results show a significant performance gap compared to SemiAnalysis benchmarks. I need clarification on the environment setup and configuration used.
My Environment
| Component |
Version |
| GPU |
NVIDIA B200 (183GB VRAM, SM100) |
| Driver |
570.195.03 |
| CUDA |
12.8.93 |
| Platform |
RunPod |
| vLLM |
0.13.0 |
Performance Gap
Comparing at similar throughput levels shows a large latency gap:
| Source |
Output Throughput |
E2E Latency |
Concurrency |
| SemiAnalysis vLLM |
~4,666 tok/s |
~10s |
C=128 |
| Our vLLM |
~3,663 tok/s |
~27s |
C=100 |
| Our vLLM |
~5,051 tok/s |
~40s |
C=200 |
At comparable latency (~10s), SemiAnalysis achieves ~4,666 tok/s while our setup would be around ~2,000 tok/s - roughly 2x performance gap.
Our Full Results
| Concurrency |
Output Throughput |
E2E Latency |
| C=1 |
215 tok/s |
4.6s |
| C=20 |
1,370 tok/s |
14.6s |
| C=50 |
2,427 tok/s |
20.6s |
| C=100 |
3,663 tok/s |
27.3s |
| C=200 |
5,051 tok/s |
39.6s |
| C=300 |
5,725 tok/s |
52.4s |
Questions
-
What driver version was used? RunPod B200 has driver 570.x (CUDA 12.8). Is driver 575+ (CUDA 13) required for optimal performance?
-
What cloud platform was used? Different platforms may have different driver/software stacks.
-
Is Docker required? The benchmark scripts reference Docker containers.
References
Any guidance on configuration or environment requirements would be appreciated. Thank you!
Summary
I am attempting to reproduce the InferenceMAX GPT-OSS-120B benchmarks on RunPod B200 but my vLLM results show a significant performance gap compared to SemiAnalysis benchmarks. I need clarification on the environment setup and configuration used.
My Environment
Performance Gap
Comparing at similar throughput levels shows a large latency gap:
At comparable latency (~10s), SemiAnalysis achieves ~4,666 tok/s while our setup would be around ~2,000 tok/s - roughly 2x performance gap.
Our Full Results
Questions
What driver version was used? RunPod B200 has driver 570.x (CUDA 12.8). Is driver 575+ (CUDA 13) required for optimal performance?
What cloud platform was used? Different platforms may have different driver/software stacks.
Is Docker required? The benchmark scripts reference Docker containers.
References
Any guidance on configuration or environment requirements would be appreciated. Thank you!