You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lmsys sglang engineers r saying that the pareto frontier is not optimal as dp_attention, and other parallelism strategy is hard set and the only thing that get sweeped is fp4 4 gpu, 8 gpu.
lmsys sglang engineers r saying that the pareto frontier is not optimal as dp_attention, and other parallelism strategy is hard set and the only thing that get sweeped is fp4 4 gpu, 8 gpu.
probably can only implement after https://github.com/InferenceMAX/InferenceMAX/pull/111 is merged
#145 makes EP, DP attention sweeps as first class feature in the yaml
B200 FP4: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/dsr1_fp4_b200_docker.sh#L20-B200 FP8: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/dsr1_fp8_b200_docker.sh
H200 FP8: https://github.com/InferenceMAX/InferenceMAX/blob/main/benchmarks/dsr1_fp8_h200_slurm.sh