Startup: add argument-consistency checks & summary table (Fixes #124) by MagellaX · Pull Request #409 · bigscience-workshop/Megatron-DeepSpeed

MagellaX · 2025-06-20T12:51:17Z

Summary

Adds a lightweight validation layer and a configuration summary printed at startup, inspired by GPT-NeoX, resolving Issue #124.

Key features

megatron/arguments.py
- _validate_and_summarize_args(args) — runs sanity checks:
  - hidden_size % num_attention_heads == 0
  - global_batch_size % data_parallel_size == 0
  - pad_vocab_size_to (if set) divisible by TP size
  - fp16 / bf16 mutual-exclusion echoed
- Builds a rank-0 console table summarising world-size layout, model dims, batch sizes, precision, and passes.
- Raises ValueError if any rule fails, aborting early before costly init.

Why it matters

Early mis-configs (e.g., mismatched hidden/head sizes or bad batch divisibility) now surface instantly, saving hours of debugging and wasted GPU time.

Testing

pytest -q tests — all existing tests pass.
Launched pretrain_gpt_tiny.sh on 1 GPU and 4 GPU runs; summary appears once on rank 0.
Introduced an invalid hidden_size (not divisible by heads) — run aborts immediately with clear error.

Backward compatibility

Purely additive logging/validation. No impact on training logic or performance.

Fixes #124

…science-workshop#124)

MagellaX added 2 commits June 20, 2025 17:32

fix(training): correct rank-zero log messages

61ed02d

startup: add argument consistency checks and summary table (Fixes big…

c2c829a

…science-workshop#124)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup: add argument-consistency checks & summary table (Fixes #124)#409

Startup: add argument-consistency checks & summary table (Fixes #124)#409
MagellaX wants to merge 2 commits intobigscience-workshop:mainfrom
MagellaX:feat/arg-validation-summary

MagellaX commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MagellaX commented Jun 20, 2025

Summary

Key features

Why it matters

Testing

Backward compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant