[Dev] Refactor and Improve MoE Logging by yanring · Pull Request #2569 · NVIDIA/Megatron-LM

yanring · 2025-12-05T14:27:20Z

What does this PR do ?

main pr #3431

PR Design / Implementation Doc

TL;DR

This PR refactors MoE metric logging from ad-hoc global state into a single structured tracker, making metric collection, distributed reduction, aggregation, and logging easier to reason about and safer to extend.

Problem Statement

Before this change, MoE logging logic was split across utility functions and mutable global dictionaries, which made behavior hard to follow and easy to break when adding new metrics or reduction rules.

Goals

Centralize MoE metric logging lifecycle in one component.
Make distributed reduction semantics explicit and configurable per metric.
Keep training-path call sites simple.
Preserve legacy entry points where practical during migration.

Non-Goals

Change MoE loss formulas.
Change optimizer/training semantics outside logging.
Introduce new user-facing CLI flags.

High-Level Design

A new singleton tracker MoEMetricsTracker in megatron/core/transformer/moe/moe_logging.py manages all MoE metric handling end-to-end:

record(...) stores per-layer metric values.
_reduce_across_ranks(...) performs distributed reductions.
_aggregate_metrics(...) computes scalar summaries.
write(...) emits scalars to TensorBoard / W&B.
get_log_string(...) formats console output.
clear() resets metric buffers each logging cycle.

Per-metric metadata is stored in _MetricEntry:

values
reduce_group
avg_group
needs_dp_avg
layer_percentiles

Key Implementation Changes

New module: megatron/core/transformer/moe/moe_logging.py
Router integration: megatron/core/transformer/moe/router.py
- Directly calls MoEMetricsTracker.get_instance().record(...)
- Uses needs_dp_avg=False for global_load_balancing_loss
Training integration: megatron/training/training.py
- Uses tracker track(...) and appends returned MoE log string to stdout log
CUDA graph cleanup path: megatron/core/transformer/cuda_graphs.py
- Clears tracker via MoEMetricsTracker.get_instance().clear()
Compatibility wrappers: megatron/core/transformer/moe/moe_utils.py
- save_to_aux_losses_tracker(...) forwards into tracker
- track_moe_metrics(...) remains deprecated wrapper

Reduction Semantics (Important)

For each metric, tracker reduction order is:

PP all-reduce
Optional reduce_group all-reduce
Optional avg_group AVG all-reduce
Optional DP AVG all-reduce controlled by needs_dp_avg

This replaces implicit/fragile logic and makes DP averaging intent explicit per metric.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-12-05T14:27:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yanring · 2025-12-05T14:36:09Z

/ok to test 9a8e019

…ogging

yanring · 2026-02-14T17:13:01Z

/ok to test 1a18019

yanring requested review from a team as code owners December 5, 2025 14:27

yanring marked this pull request as draft December 5, 2025 14:27

copy-pr-bot bot temporarily deployed to nemo-ci December 5, 2025 14:36 Inactive

ko3n1g added this to the Core 0.16 milestone Dec 5, 2025

copy-pr-bot bot temporarily deployed to nemo-ci December 5, 2025 14:36 Inactive

init implementation

92224cc

copy-pr-bot bot temporarily deployed to test December 5, 2025 14:37 Inactive

yanring added 3 commits December 5, 2025 14:37

refactor moe logging

4040104

fix pg_collection

c39bf49

remove load_balance_discrepancy

60548d1

yaoyu-33 mentioned this pull request Dec 10, 2025

Help! Under token dropless scenario, how to balance experts during pretraining for Qwen3-30B-A3B? NVIDIA-NeMo/Megatron-Bridge#947

Open

Merge remote-tracking branch 'origin/dev' into zijiey/add_imbalance_l…

71b3d1f

…ogging

yanring force-pushed the zijiey/add_imbalance_logging branch from 9a8e019 to 71b3d1f Compare January 12, 2026 08:00

yanring changed the title ~~[Dev] Improve MoE Logging~~ [Dev] Refactor and Improve MoE Logging Jan 12, 2026

yanring added 5 commits January 12, 2026 08:12

bug fix

9c855ee

update

fc36514

Merge remote-tracking branch 'origin/dev' into zijiey/add_imbalance_l…

00ee966

…ogging

update

5bfa40d

cleanup

1a18019

yanring mentioned this pull request Feb 14, 2026

[main] Refactor and Improve MoE Logginginit commit #3431

Draft

6 tasks

yanring self-assigned this Feb 14, 2026

yanring marked this pull request as ready for review February 14, 2026 17:11

yanring requested review from a team as code owners February 14, 2026 17:11

copy-pr-bot bot temporarily deployed to nemo-ci February 14, 2026 17:13 Inactive

yanring added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Dev] Refactor and Improve MoE Logging#2569

[Dev] Refactor and Improve MoE Logging#2569
yanring wants to merge 10 commits intoNVIDIA:devfrom
yanring:zijiey/add_imbalance_logging

yanring commented Dec 5, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 5, 2025

Uh oh!

yanring commented Dec 5, 2025

Uh oh!

yanring commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

yanring commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

PR Design / Implementation Doc

TL;DR

Problem Statement

Goals

Non-Goals

High-Level Design

Key Implementation Changes

Reduction Semantics (Important)

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Dec 5, 2025

Uh oh!

yanring commented Dec 5, 2025

Uh oh!

yanring commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yanring commented Dec 5, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`