[BUG] Inaccurate FLOPs Calculation for Causal and Specialized Attention

### Description

This issue highlights inaccuracies in the FLOPs calculation for decoder-based models within `nemo/utils/flops_formulas.py`. Correcting these formulas is crucial for accurate model comparison and resource planning.

This alignment has been a recent focus in other major frameworks. For instance, **Megatron-LM** has updated its FLOPs calculations, and Google's **MaxText** has also refined its formulas to improve accuracy (see PRs [#1988](https://github.com/AI-Hypercomputer/maxtext/pull/1988) and [#2030](https://github.com/AI-Hypercomputer/maxtext/pull/2030)).

The problems in NeMo are twofold:
1.  Standard causal attention isn't consistently accounted for, leading to a 2x overestimation of attention FLOPs for models like Llama and Mixtral.
2.  Models with specialized attention mechanisms require unique formulas, which are not currently implemented.

---

### 1. Causal Mask Inconsistency

FLOPs formulas should be consistent for all decoder models using standard causal attention.

**Correct Implementation:** The formulas for the base [**Transformer**](https://github.com/NVIDIA/NeMo/blob/681601e27fa62124cb01b7dd47fcdb08fdb416e2/nemo/utils/flops_formulas.py#L263) and [**DeepSeekV2**](https://github.com/NVIDIA/NeMo/blob/681601e27fa62124cb01b7dd47fcdb08fdb416e2/nemo/utils/flops_formulas.py#L388-L395) properly account for their respective attention mechanisms. The base Transformer formula correctly divides the FLOPs by two for the causal mask.

**Missing Correction:** This ÷2 adjustment for causal attention is absent in the formulas for other prominent models, notably [**Llama 2**](https://github.com/NVIDIA/NeMo/blob/681601e27fa62124cb01b7dd47fcdb08fdb416e2/nemo/utils/flops_formulas.py#L74-L91), [**Llama 3**](https://github.com/NVIDIA/NeMo/blob/681601e27fa62124cb01b7dd47fcdb08fdb416e2/nemo/utils/flops_formulas.py#L94-L111), and [**Mixtral**](https://github.com/NVIDIA/NeMo/blob/681601e27fa62124cb01b7dd47fcdb08fdb416e2/nemo/utils/flops_formulas.py#L134-L151). This causes their calculated attention FLOPs to be double what they should be.

---

### 2. Inaccuracy for Specialized Architectures

For new architectures, inheriting a generic formula can lead to errors. For example, the flops_callback.py [dispatches](https://github.com/NVIDIA/NeMo/blob/681601e27fa62124cb01b7dd47fcdb08fdb416e2/nemo/lightning/pytorch/callbacks/flops_callback.py#L34) formulas by model type.

If a new model like "Llama4" uses **chunked attention**, simply applying the Llama 3 formula with a causal correction would still be incorrect. Chunked attention has a different computational cost that needs its own specific formula.

### Proposed Solution

1.  **Standardize Causal Attention:** Apply the ÷2 adjustment to the attention FLOPs calculation for all relevant decoder models (Llama, Mixtral, etc.) to align with standard practice and ensure consistency.
2.  **Implement Architectural Specificity:** For models that do not use full causal attention masks -- such as Llama4, Gemma2, Gemma3, which employ chunked or local attention -- define a custom FLOPs formula that accurately reflects their reduced attention computation.

Thank you for your consideration.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inaccurate FLOPs Calculation for Causal and Specialized Attention #14376

Description

1. Causal Mask Inconsistency

2. Inaccuracy for Specialized Architectures

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Inaccurate FLOPs Calculation for Causal and Specialized Attention #14376

Description

Description

1. Causal Mask Inconsistency

2. Inaccuracy for Specialized Architectures

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions