suggestion from lmsys sglang: deepseek

lmsys sglang engineers r saying that the pareto frontier is not optimal as dp_attention, and other parallelism strategy is hard set and the only thing that get sweeped is fp4  4 gpu, 8 gpu. 

probably can only implement after https://github.com/InferenceMAX/InferenceMAX/pull/111 is merged 

#145  makes EP, DP attention sweeps as first class feature in the yaml

B200 FP4: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/dsr1_fp4_b200_docker.sh#L20-B200 FP8: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/dsr1_fp8_b200_docker.sh
H200 FP8: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/dsr1_fp8_h200_slurm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suggestion from lmsys sglang: deepseek #138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

suggestion from lmsys sglang: deepseek #138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions