Skip to content

Pipeline-parallel support for Knowledge Distillation (NeMo 2)#11766

Merged
ko3n1g merged 23 commits intoNVIDIA-NeMo:mainfrom
AAnoosheh:aanoosheh/pp-distillation-nemo2
Feb 6, 2025
Merged

Pipeline-parallel support for Knowledge Distillation (NeMo 2)#11766
ko3n1g merged 23 commits intoNVIDIA-NeMo:mainfrom
AAnoosheh:aanoosheh/pp-distillation-nemo2

Conversation

@AAnoosheh
Copy link
Collaborator

@AAnoosheh AAnoosheh commented Jan 6, 2025

What does this PR do ?

Enable Pipeline-Parallelism in conjunction with student-teacher distillation in NeMo 2

Collection: [LLM]

Changelog

  • Create new script to enable KD in NeMo 2
  • Add folder nemo/collections/llm/distillation
  • Modify the MegatronParallel forward pass to run teacher in addition to student

Usage

        python scripts/llm/gpt_distillation.py \
          --name experiment_name \
          --teacher_path /path/to/nemo2/teacher \
          --student_path /path/to/nemo2/student \
          --tp_size 4 \
          --cp_size 1 \
          --pp_size 2 \
          --devices 8 \
          --log_dir /tmp/nemo2_llama_distill \
          --max_steps 100 \
          --gbs 64 \
          --mbs 4 \
          --data_paths weight1 /path/to/data/1 weight2 /path/to/data/2 \
          --index_mapping_dir /path/to/data/cache/ \
          --seq_length 8192 \
          --warmup_steps 5 \
          --val_check_interval 50 \
          --log_interval 5

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Megatron-LM vs NeMo 2.0 Llama3.1 8b->4b distillation comparison (Validation LM-Loss)
Screen Shot 2025-02-04 at 3 01 27 PM

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
@AAnoosheh AAnoosheh enabled auto-merge (squash) February 4, 2025 20:32
@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from github-actions bot Feb 4, 2025
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
@AAnoosheh AAnoosheh dismissed stale reviews from ko3n1g and yashaswikarnati via e297474 February 4, 2025 22:11
@ko3n1g ko3n1g disabled auto-merge February 4, 2025 22:20
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Feb 4, 2025
@ko3n1g ko3n1g enabled auto-merge (squash) February 4, 2025 22:24
@AAnoosheh AAnoosheh removed the request for review from dimapihtar February 5, 2025 15:20
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 30.43%. Comparing base (09186c3) to head (b8dda2c).
Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11766      +/-   ##
==========================================
+ Coverage   30.30%   30.43%   +0.13%     
==========================================
  Files        1387     1399      +12     
  Lines      176283   177330    +1047     
  Branches    27091    27185      +94     
==========================================
+ Hits        53423    53972     +549     
- Misses     118776   119241     +465     
- Partials     4084     4117      +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ko3n1g ko3n1g merged commit e51ec38 into NVIDIA-NeMo:main Feb 6, 2025
219 of 221 checks passed
BoxiangW pushed a commit that referenced this pull request Feb 7, 2025
* First draft of distill script port to 2.0

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Pipeline-parallel changes

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Basic distillation running

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Add CLI args

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Most fixes

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Fix callbacks in PP loop

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* More fixes

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Rework checkpoint loading

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Resolve seemingly remaining bugs

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Refactor into multiple files

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Integration test

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Clean up strings

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Appease linter

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Remediate failing tests

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Update CICD model definition

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Divert TB logger to same log_dir

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Load CICD model specially

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Fix SP flag

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Move test into own script

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Update cicd dependency

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Update cicd thing #2

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Fix new linting errors

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

---------

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
youngeunkwon0405 pushed a commit to youngeunkwon0405/NeMo that referenced this pull request Feb 10, 2025
…-NeMo#11766)

* First draft of distill script port to 2.0

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Pipeline-parallel changes

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Basic distillation running

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Add CLI args

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Most fixes

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Fix callbacks in PP loop

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* More fixes

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Rework checkpoint loading

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Resolve seemingly remaining bugs

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Refactor into multiple files

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Integration test

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Clean up strings

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Appease linter

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Remediate failing tests

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Update CICD model definition

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Divert TB logger to same log_dir

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Load CICD model specially

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Fix SP flag

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Move test into own script

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Update cicd dependency

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Update cicd thing NVIDIA-NeMo#2

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

* Fix new linting errors

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>

---------

Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@AAnoosheh AAnoosheh deleted the aanoosheh/pp-distillation-nemo2 branch March 21, 2025 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments