Skip to content

Akoumparouli/distirbuted#1

Closed
akoumpa wants to merge 12 commits intomainfrom
akoumparouli/distirbuted
Closed

Akoumparouli/distirbuted#1
akoumpa wants to merge 12 commits intomainfrom
akoumparouli/distirbuted

Conversation

@akoumpa
Copy link
Copy Markdown
Contributor

@akoumpa akoumpa commented May 22, 2025

No description provided.

hemildesai and others added 12 commits May 6, 2025 13:10
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
* move examples to recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move everything under automodel

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* baby steps

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* import from NeMo 1f511fd & bfbd333

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* simplify

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add get method with fallback

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add ranked param

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update resolve target

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* renmame

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add rng.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleanup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move files

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add __contains__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* special handle for _fn keys

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move utils to file

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move hellaswag to SFTSingleTurnPreprocessor

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix for _fn

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add base recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
* cleaup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleaup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move DistInfo

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move num_epochs to StepScheduler

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* change recipe name to FinetuneRecipeForNextTokenPrediction

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* change recipe name to FinetuneRecipeForNextTokenPrediction

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* change recipe name to FinetuneRecipeForNextTokenPrediction

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/distirbuted branch from bec7ca9 to 36c0c68 Compare May 22, 2025 07:42
akoumpa added a commit that referenced this pull request May 27, 2025
* move examples to recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move everything under automodel

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* baby steps

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* import from NeMo 1f511fd & bfbd333

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* simplify

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add get method with fallback

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add ranked param

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update resolve target

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* renmame

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add rng.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleanup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move files

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add __contains__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* special handle for _fn keys

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move utils to file

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move hellaswag to SFTSingleTurnPreprocessor

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix for _fn

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add base recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
akoumpa added a commit that referenced this pull request May 27, 2025
* move examples to recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move everything under automodel

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* baby steps

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* import from NeMo 1f511fd & bfbd333

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* simplify

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add get method with fallback

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add ranked param

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update resolve target

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* renmame

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add rng.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleanup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move files

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add __contains__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* special handle for _fn keys

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move utils to file

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move hellaswag to SFTSingleTurnPreprocessor

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix for _fn

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add base recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
akoumpa added a commit that referenced this pull request May 28, 2025
* Initial commit

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Add Llama 3.2 1b hellaswag finetuning example

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* [automodel] misc fixes (#1)

* move examples to recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move everything under automodel

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* baby steps

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* import from NeMo 1f511fd & bfbd333

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* simplify

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add get method with fallback

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add ranked param

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update resolve target

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* renmame

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add rng.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleanup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move files

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add __contains__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* special handle for _fn keys

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move utils to file

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move hellaswag to SFTSingleTurnPreprocessor

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix for _fn

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add base recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleaup (#3)

* cleaup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleaup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move DistInfo

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move num_epochs to StepScheduler

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* change recipe name to FinetuneRecipeForNextTokenPrediction

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* change recipe name to FinetuneRecipeForNextTokenPrediction

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* change recipe name to FinetuneRecipeForNextTokenPrediction

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* restructure (#2)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* step

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* step

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move files

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move file

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move file

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* step

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* state

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update docsring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* change args

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* ddp fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* ddp fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* lint fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove checkpoint.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove model_utils

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* moved init_utils to automodel.distributed

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove config.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove checkpoint_utils.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleanup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove custom automodel for now

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleanup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add missing import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused wandb

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move datasets to llm

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove config_utils

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pass num_grad_acc_steps

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix defaults

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add assert

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move StepScheduler to base_recipe.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move StepScheduler to base_recipe.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* cleanup

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* move base_recipe into training

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fx

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix import path

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix default args

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add assert

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* inherit from Stateful

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix imports

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add epochs propery

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor update

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstring

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
akoumpa added a commit that referenced this pull request Apr 21, 2026
#1936)

fix: Step-3.5-Flash layer_types mismatch and related recipe fixes (#1916)

* fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes

- Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer.
- Retry AutoConfig.from_pretrained when upstream configs ship layer_types
  longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by
  truncating layer_types in the raw config dict and rebuilding via
  the resolved config class (dynamic module or CONFIG_MAPPING).
- Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and
  qwen3_moe_30b_uccl_ep ep_size 16 -> 8.




* Update uv lock



* Apply suggestion from @claude[bot]



---------

Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants