Pipeline-parallel support for Knowledge Distillation (NeMo 2)#11766
Merged
ko3n1g merged 23 commits intoNVIDIA-NeMo:mainfrom Feb 6, 2025
AAnoosheh:aanoosheh/pp-distillation-nemo2
Merged
Pipeline-parallel support for Knowledge Distillation (NeMo 2)#11766ko3n1g merged 23 commits intoNVIDIA-NeMo:mainfrom AAnoosheh:aanoosheh/pp-distillation-nemo2
ko3n1g merged 23 commits intoNVIDIA-NeMo:mainfrom
AAnoosheh:aanoosheh/pp-distillation-nemo2
Conversation
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
e297474
yashaswikarnati
approved these changes
Feb 4, 2025
ko3n1g
approved these changes
Feb 4, 2025
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11766 +/- ##
==========================================
+ Coverage 30.30% 30.43% +0.13%
==========================================
Files 1387 1399 +12
Lines 176283 177330 +1047
Branches 27091 27185 +94
==========================================
+ Hits 53423 53972 +549
- Misses 118776 119241 +465
- Partials 4084 4117 +33 ☔ View full report in Codecov by Sentry. |
BoxiangW
pushed a commit
that referenced
this pull request
Feb 7, 2025
* First draft of distill script port to 2.0 Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Pipeline-parallel changes Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Basic distillation running Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Add CLI args Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Most fixes Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Fix callbacks in PP loop Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * More fixes Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Rework checkpoint loading Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Resolve seemingly remaining bugs Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Refactor into multiple files Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Integration test Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Clean up strings Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Appease linter Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Remediate failing tests Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Update CICD model definition Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Divert TB logger to same log_dir Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Load CICD model specially Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Fix SP flag Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Move test into own script Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Update cicd dependency Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Update cicd thing #2 Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Fix new linting errors Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> --------- Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
youngeunkwon0405
pushed a commit
to youngeunkwon0405/NeMo
that referenced
this pull request
Feb 10, 2025
…-NeMo#11766) * First draft of distill script port to 2.0 Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Pipeline-parallel changes Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Basic distillation running Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Add CLI args Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Most fixes Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Fix callbacks in PP loop Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * More fixes Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Rework checkpoint loading Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Resolve seemingly remaining bugs Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Refactor into multiple files Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Integration test Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Clean up strings Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Appease linter Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Remediate failing tests Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Update CICD model definition Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Divert TB logger to same log_dir Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Load CICD model specially Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Fix SP flag Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Move test into own script Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Update cicd dependency Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Update cicd thing NVIDIA-NeMo#2 Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> * Fix new linting errors Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> --------- Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com> Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Enable Pipeline-Parallelism in conjunction with student-teacher distillation in NeMo 2
Collection: [LLM]
Changelog
nemo/collections/llm/distillationMegatronParallelforward pass to run teacher in addition to studentUsage
python scripts/llm/gpt_distillation.py \ --name experiment_name \ --teacher_path /path/to/nemo2/teacher \ --student_path /path/to/nemo2/student \ --tp_size 4 \ --cp_size 1 \ --pp_size 2 \ --devices 8 \ --log_dir /tmp/nemo2_llama_distill \ --max_steps 100 \ --gbs 64 \ --mbs 4 \ --data_paths weight1 /path/to/data/1 weight2 /path/to/data/2 \ --index_mapping_dir /path/to/data/cache/ \ --seq_length 8192 \ --warmup_steps 5 \ --val_check_interval 50 \ --log_interval 5GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
Megatron-LM vs NeMo 2.0 Llama3.1 8b->4b distillation comparison (Validation LM-Loss)
