-
Notifications
You must be signed in to change notification settings - Fork 155
[AMD/ROCM] atom qwen fp8/fp8_mtp3 on mi355x #1040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
77fb19b
965fc96
a4e3555
88ecd67
de0b445
0d26bce
b989a64
7be88d6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,82 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION" | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| export OMP_NUM_THREADS=1 | ||
|
|
||
| # Calculate max-model-len based on ISL and OSL | ||
| if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then | ||
| CALCULATED_MAX_MODEL_LEN="" | ||
| else | ||
| CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 " | ||
| fi | ||
|
|
||
| if [ "$EP_SIZE" -gt 1 ]; then | ||
| EP=" --enable-expert-parallel" | ||
| else | ||
| EP=" " | ||
| fi | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
| MEM_FRAC_STATIC=0.9 | ||
|
|
||
| set -x | ||
|
|
||
| python3 -m atom.entrypoints.openai_server \ | ||
| --model $MODEL \ | ||
| --server-port $PORT \ | ||
| -tp $TP \ | ||
| --kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \ | ||
| --gpu-memory-utilization $MEM_FRAC_STATIC \ | ||
| --trust-remote-code \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| export PYTHONDONTWRITEBYTECODE=1 | ||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION" | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| export OMP_NUM_THREADS=1 | ||
|
|
||
| # Calculate max-model-len based on ISL and OSL | ||
| if [ "$ISL" = "1024" ] && [ "$OSL" = "1024" ]; then | ||
| CALCULATED_MAX_MODEL_LEN="" | ||
| else | ||
| CALCULATED_MAX_MODEL_LEN=" --max-model-len 10240 " | ||
| fi | ||
|
|
||
| if [ "$EP_SIZE" -gt 1 ]; then | ||
| EP=" --enable-expert-parallel" | ||
| else | ||
| EP=" " | ||
| fi | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
| MEM_FRAC_STATIC=0.9 | ||
|
|
||
| set -x | ||
|
|
||
| python3 -m atom.entrypoints.openai_server \ | ||
| --model $MODEL \ | ||
| --server-port $PORT \ | ||
| -tp $TP \ | ||
| --kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \ | ||
| --gpu-memory-utilization $MEM_FRAC_STATIC \ | ||
| --method mtp \ | ||
| --num-speculative-tokens 3 \ | ||
| --trust-remote-code \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
|
Comment on lines
+45
to
+56
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The bf16 atom script (qwen3.5_bf16_mi355x_atom.sh) is a byte-for-byte copy of the fp8 script and incorrectly applies --kv_cache_dtype fp8 (line 49) to a native BF16 model, producing non-representative BF16 benchmark results that are actually fp8-KV-quantized runs. Additionally, the bf16 atom script has no corresponding entry in .github/configs/amd-master.yaml (only fp8 and fp4 atom YAML entries were added), so the benchmark pipeline cannot invoke it at all — fix both issues before merging. Extended reasoning...Issue 1: --kv_cache_dtype fp8 in the bf16 atom script The two scripts qwen3.5_bf16_mi355x_atom.sh and qwen3.5_fp8_mi355x_atom.sh are byte-for-byte identical (verified by diff in the PR). Both pass Addressing the refutation (intentional pattern) One verifier argued this is intentional because dsr1_fp8_mi355x_atom.sh and dsr1_fp4_mi355x_atom.sh are identical and both use --kv_cache_dtype fp8, with precision differentiation done via the YAML model field. This is partially correct for the DSR1 case — but crucially, there is a qwen3.5-bf16-mi355x-sglang config using model Qwen/Qwen3.5-397B-A17B, and the corresponding sglang script (qwen3.5_bf16_mi355x.sh) does NOT use --kv_cache_dtype fp8. Every other BF16 benchmark script in this codebase for this model follows the same pattern of omitting KV quantization. If the intent for atom-framework BF16 benchmarking is to also compress the KV cache, that should be an explicit and documented decision — not an accidental copy-paste from the fp8 variant. Issue 2: Missing YAML config entry for qwen3.5-bf16-mi355x-atom The PR adds qwen3.5-fp8-mi355x-atom and qwen3.5-fp4-mi355x-atom entries to .github/configs/amd-master.yaml, but no qwen3.5-bf16-mi355x-atom entry. The benchmark pipeline discovers which benchmarks to run from this YAML; without an entry, qwen3.5_bf16_mi355x_atom.sh is an orphaned script that can never be triggered. This is confirmed by grep returning no results for 'qwen3.5-bf16-mi355x-atom' anywhere in the YAML. The PR title explicitly says 'fp8/bf16 on mi355x', so bf16 atom was clearly intended to be integrated. Concrete proof of the dual problem Step 1: The YAML entries qwen3.5-fp8-mi355x-atom and qwen3.5-fp4-mi355x-atom both reference their respective model checkpoints (FP8 and MXFP4 variants) and have corresponding benchmark scripts. Step 2: A matching qwen3.5-bf16-mi355x-atom YAML entry with model: Qwen/Qwen3.5-397B-A17B is absent. Step 3: Even if a bf16 atom YAML entry were added now and pointed the pipeline at qwen3.5_bf16_mi355x_atom.sh, the script would still launch the server with --kv_cache_dtype fp8, meaning the resulting numbers would reflect BF16 model weights + FP8 KV cache, not a clean BF16 baseline. Step 4: Compare to qwen3.5-bf16-mi355x-sglang: that config uses Qwen/Qwen3.5-397B-A17B with no --kv_cache_dtype flag — the expected BF16 baseline behavior. How to fix
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| export PYTHONDONTWRITEBYTECODE=1 | ||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
Uh oh!
There was an error while loading. Please reload this page.