运行vidur时，采用“backend=aicb” 导致 attention 耗时路径读取缺失的 attn_kv_cache_save_* 字段并触发 KeyError

## 问题描述

在使用 `backend=aicb` 运行模拟器时，attention 的预测路径只生成了以下字段：

- `attn_prefill`
- `attn_decode`

但后续在计算执行时间时，代码仍然会尝试读取：

- `attn_kv_cache_save_prefill`
- `attn_kv_cache_save_decode`

最终导致程序在运行时崩溃，并报错：

```python
KeyError: 'attn_kv_cache_save_prefill'

根因分析

在 backend=aicb 模式下：
使用的是 _predict_from_models_by_aicb(...)
attention 预测来自 _predict_for_attention_layer_models_by_aicb(...)
但是 _predict_for_attention_layer_models_by_aicb(...) 只写入了： attn_prefill和attn_decode
并没有生成：attn_kv_cache_save_prefill和attn_kv_cache_save_decode
而后续 _get_attention_kv_cache_save_execution_time(...) 又默认这些字段一定存在，并直接读取，因此触发了 KeyError。

期望行为

在使用 backend=aicb 时，模拟器不应因缺失字段而崩溃。
比较合理的行为包括：
AICB 的预测逻辑同时补充生成：attn_kv_cache_save_prefill和attn_kv_cache_save_decode

实际行为

程序直接崩溃，并报错：
KeyError: 'attn_kv_cache_save_prefill'

影响范围

这个问题会导致 backend=aicb 路径在进入 attention 执行时间计算时无法正常运行。
也就是说，只要走到这段逻辑，就可能因为缺少对应字段而直接失败。

复现命令行：
# 1) 启动AICB并开启AIOB，先生成Vidur可用的inference workload画像
cd ../aicb
python -m workload_generator.Vidur_workload_generator DeepSeek-671B ./scripts/inference_configs/deepseek_default.json \
  --phase decode \
  --seq_length 2048 \
  --micro_batch 1 \
  --world_size 64 \
  --tensor_model_parallel_size 4 \
  --pipeline_model_parallel 1 \
  --expert_model_parallel_size 1

# 2) 启动Vidur，执行时间后端切到AICB
cd ../vidur-alibabacloud
python -m vidur.main \
  --replica_config_pd_p2p_comm_bandwidth 800 \
  --replica_config_nvlink_bandwidth 1600 \
  --replica_config_rdma_bandwidth 800 \
  --replica_config_pd_p2p_comm_dtype float32 \
  --poisson_request_interval_generator_config_qps 100 \
  --synthetic_request_generator_config_num_requests 1 \
  --length_generator_config_type trace \
  --trace_request_length_generator_config_max_tokens 2048 \
  --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \
  --interval_generator_config_type poisson \
  --cluster_config_num_replicas 16 \
  --replica_config_pd_node_ratio 0.5 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  --split_wise_replica_scheduler_config_max_tokens_in_batch 400 \
  --replica_config_model_name deepseek-671B \
  --replica_config_tensor_parallel_size 4 \
  --replica_config_num_pipeline_stages 1 \
  --replica_config_expert_model_parallel_size 1 \
  --random_forrest_execution_time_predictor_config_backend aicb \
  --random_forrest_execution_time_predictor_config_simai_dir ../ \
  --random_forrest_execution_time_predictor_config_simai_simulation_topo ../Spectrum-X_128g_8gps_100Gbps_A100 \
  --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

运行vidur时，采用“backend=aicb” 导致 attention 耗时路径读取缺失的 attn_kv_cache_save_* 字段并触发 KeyError #266

问题描述

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

运行vidur时，采用“backend=aicb” 导致 attention 耗时路径读取缺失的 attn_kv_cache_save_* 字段并触发 KeyError #266

Description

问题描述

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions