Skip to content

Comments

[Dev] Refactor and Improve MoE Logging#2569

Open
yanring wants to merge 10 commits intoNVIDIA:devfrom
yanring:zijiey/add_imbalance_logging
Open

[Dev] Refactor and Improve MoE Logging#2569
yanring wants to merge 10 commits intoNVIDIA:devfrom
yanring:zijiey/add_imbalance_logging

Conversation

@yanring
Copy link
Contributor

@yanring yanring commented Dec 5, 2025

What does this PR do ?

main pr #3431

PR Design / Implementation Doc

TL;DR

This PR refactors MoE metric logging from ad-hoc global state into a single structured tracker, making metric collection, distributed reduction, aggregation, and logging easier to reason about and safer to extend.


Problem Statement

Before this change, MoE logging logic was split across utility functions and mutable global dictionaries, which made behavior hard to follow and easy to break when adding new metrics or reduction rules.


Goals

  • Centralize MoE metric logging lifecycle in one component.
  • Make distributed reduction semantics explicit and configurable per metric.
  • Keep training-path call sites simple.
  • Preserve legacy entry points where practical during migration.

Non-Goals

  • Change MoE loss formulas.
  • Change optimizer/training semantics outside logging.
  • Introduce new user-facing CLI flags.

High-Level Design

A new singleton tracker MoEMetricsTracker in megatron/core/transformer/moe/moe_logging.py manages all MoE metric handling end-to-end:

  1. record(...) stores per-layer metric values.
  2. _reduce_across_ranks(...) performs distributed reductions.
  3. _aggregate_metrics(...) computes scalar summaries.
  4. write(...) emits scalars to TensorBoard / W&B.
  5. get_log_string(...) formats console output.
  6. clear() resets metric buffers each logging cycle.

Per-metric metadata is stored in _MetricEntry:

  • values
  • reduce_group
  • avg_group
  • needs_dp_avg
  • layer_percentiles

Key Implementation Changes

  • New module: megatron/core/transformer/moe/moe_logging.py
  • Router integration: megatron/core/transformer/moe/router.py
    • Directly calls MoEMetricsTracker.get_instance().record(...)
    • Uses needs_dp_avg=False for global_load_balancing_loss
  • Training integration: megatron/training/training.py
    • Uses tracker track(...) and appends returned MoE log string to stdout log
  • CUDA graph cleanup path: megatron/core/transformer/cuda_graphs.py
    • Clears tracker via MoEMetricsTracker.get_instance().clear()
  • Compatibility wrappers: megatron/core/transformer/moe/moe_utils.py
    • save_to_aux_losses_tracker(...) forwards into tracker
    • track_moe_metrics(...) remains deprecated wrapper

Reduction Semantics (Important)

For each metric, tracker reduction order is:

  1. PP all-reduce
  2. Optional reduce_group all-reduce
  3. Optional avg_group AVG all-reduce
  4. Optional DP AVG all-reduce controlled by needs_dp_avg

This replaces implicit/fragile logic and makes DP averaging intent explicit per metric.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@yanring yanring requested review from a team as code owners December 5, 2025 14:27
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 5, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yanring yanring marked this pull request as draft December 5, 2025 14:27
@yanring
Copy link
Contributor Author

yanring commented Dec 5, 2025

/ok to test 9a8e019

@yanring yanring force-pushed the zijiey/add_imbalance_logging branch from 9a8e019 to 71b3d1f Compare January 12, 2026 08:00
@yanring yanring changed the title [Dev] Improve MoE Logging [Dev] Refactor and Improve MoE Logging Jan 12, 2026
@yanring yanring self-assigned this Feb 14, 2026
@yanring yanring marked this pull request as ready for review February 14, 2026 17:11
@yanring yanring requested review from a team as code owners February 14, 2026 17:11
@yanring
Copy link
Contributor Author

yanring commented Feb 14, 2026

/ok to test 1a18019

@yanring yanring added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Expert Review Apply this label to indicate that your PR is ready for expert review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants