Resolve the discrepancy of latency report between LLMs and non-LLMs

### 🐛 Describe the bug

![Image](https://github.com/user-attachments/assets/d3ef85bf-c057-490a-ab1d-96f5d691817a)

As shown on the dashboard, the `avg_inference_latency (ms)` is skipped for LLM, and report only `generate_time (ms)` instead.

Upon checking the iOS run for example, a LLM job will run three tests on-device to report different metrics:
  1. `test_load_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4`
  2. `test_forward_llama_3_2_1b_llama3_fb16_pte_iOS_17_2_1_iPhone15_4`
  3. `test_generate_llama_3_2_1b_llama3_fb16_pte_tokenizer_model_iOS_17_2_1_iPhone15_4`
While a non-LLM job will only run the first two tests (test_load_ and test_forward_ ) instead.

See detailed jobs here: 
  - LLM: https://github.com/pytorch/executorch/actions/runs/13403521306/job/37441009799
  - non-LLM: https://github.com/pytorch/executorch/actions/runs/13403521306/job/37441008720

**Three things to get clarification in this task:**
  1. Because `test_forward_*` is reported to both LLM and non-LLM, why isn't reported to the dash?
  2. Let's annotate each metrics in the DB so users will know what exactly is measured by each.
  ~~3. Confirm if Android is measuring and reporting exact same metrics~~ #8578 

### Versions

trunk

cc @huydhn @kirklandsign @shoumikhin @mergennachin @byjlw

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resolve the discrepancy of latency report between LLMs and non-LLMs #8576

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions