Skip to content

fix: tied embedding v4 to v5#1631

Merged
akoumpa merged 8 commits intomainfrom
akoumparouli/fix_tied_embedding_v4_to_v5
Mar 31, 2026
Merged

fix: tied embedding v4 to v5#1631
akoumpa merged 8 commits intomainfrom
akoumparouli/fix_tied_embedding_v4_to_v5

Conversation

@akoumpa
Copy link
Copy Markdown
Contributor

@akoumpa akoumpa commented Mar 31, 2026

What does this PR do ?

class NemotronFlashForCausalLM(NemotronFlashPreTrainedModel, GenerationMixin):
    _tied_weights_keys = ["lm_head.weight"]  # <- here

    def __init__(self, config: NemotronFlashConfig):
        super().__init__(config)
        self.config = config
        self.model = NemotronFlashModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        self.post_init()

In transformers v4, _tied_weights_keys was used to denote which layers are tied the the nn.Embedding.

In v5 the concept was advanced and instead of being embedding to all, it is now a dictionary with a source and destination FQNs.

As a result, when trying to restore v4 checkpoints in v5, these will fail, since the expectation is that the _tied_weights_keys will be a dictionary.

This PR fixes on-the-fly those cases so that they can be loaded correctly in v5 (previously crashing).

distributed:
  strategy: fsdp2
  dp_size: 8
  tp_size: 1
  cp_size: 1
  pp_size: 1
  sequence_parallel: false
  activation_checkpointing: true
checkpoint:
  checkpoint_dir: checkpoints/
  enabled: false
  model_save_format: torch_save
  save_consolidated: false
dataloader:
  _target_: torchdata.stateful_dataloader.stateful_dataloader.StatefulDataLoader
  collate_fn: nemo_automodel.components.datasets.utils.default_collater
  shuffle: false
dataset:
  _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
  dataset_name: rajpurkar/squad
  split: train
dist_env:
  backend: nccl
  timeout_minutes: 1
loss_fn:
  _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: /akoumparouli/automodel/ckpt/nvidia/Nemotron-Flash-1B/
  torch_dtype: bf16
  trust_remote_code: true
  dtype: bfloat16
optimizer:
  _target_: torch.optim.adamw.AdamW
  lr: 0.0002
  weight_decay: 0.1
packed_sequence:
  packed_sequence_size: 512
  split_across_pack: false
step_scheduler:
  ckpt_every_steps: 2000
  global_batch_size: 128
  local_batch_size: 4
  max_steps: 20
  num_epochs: 1

Note: this model has another issue, when initializing on meta the A_log is initialized as float32 while the rest bfloat16, however, the public checkpoint uses bfloat16 everywhere. The issue is fsdp2 cannot wrap a module with mixed dtypes, however, the issue is that the dtype of the checkpoint is not preserved.

model.layers.17.ffn.gate_proj.weight torch.bfloat16
model.layers.17.ffn.down_proj.weight torch.bfloat16
model.layers.17.ffn.up_proj.weight torch.bfloat16
model.layers.17.pre_ffn_layernorm.weight torch.bfloat16
model.layers.18.mamba.dt_bias torch.bfloat16
model.layers.18.mamba.A_log torch.float32
model.layers.18.mamba.D torch.bfloat16
model.layers.18.mamba.in_proj.weight torch.bfloat16
model.layers.18.mamba.conv1d.weight torch.bfloat16
model.layers.18.mamba.conv1d.bias torch.bfloat16
model.layers.18.mamba.norm.weight torch.bfloat16
model.layers.18.mamba.out_proj.weight torch.bfloat16
model.layers.18.input_layernorm.weight torch.bfloat16
model.layers.19.ffn.gate_proj.weight torch.bfloat16

Changelog

  • Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

akoumpa added 2 commits March 30, 2026 19:45
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 31, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Mar 31, 2026

/ok to test a38292d

@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Mar 31, 2026

/claude review

claude[bot]
claude Bot previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Mar 31, 2026

/ok to test 1fe2c82

@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Mar 31, 2026

/claude review

Comment thread nemo_automodel/components/checkpoint/utils.py
Comment thread tests/unit_tests/_transformers/test_auto_model.py
@akoumpa akoumpa merged commit 54cbe82 into main Mar 31, 2026
52 of 53 checks passed
@akoumpa akoumpa deleted the akoumparouli/fix_tied_embedding_v4_to_v5 branch March 31, 2026 19:04
HuiyingLi added a commit that referenced this pull request Apr 2, 2026
* docs: update coverage doc (#1609)

* fix: resolve TP+PP pipeline parallelism bugs for custom HF models

When pipeline parallelism splits a model, nn.ModuleList layers are
converted to nn.ModuleDict. Three issues surfaced with custom models
(e.g. DeciLM/Nemotron-49B) that use explicit self.num_heads in
attention views and return tuples from decoder layers:

1. _update_attention_head_counts_for_tp iterates `for layer in layers`,
   which yields string keys (not modules) for ModuleDict — head counts
   were never updated, causing shape mismatches in the Q/K/V view.

2. The walrus operator fallback for causal_mask_mapping could leave a
   raw 2D attention_mask in place of the expected 4D causal mask when
   the import or computation failed silently.

3. The batch device-move code filtered out None values from nested
   dicts, dropping causal_mask_mapping entries for sdpa-configured
   models where create_causal_mask returns None.

Additionally, decoder layers in older-style HF models (pre-v5) return
tuples rather than bare tensors, and raw 2D padding masks that leak
through the pipeline schedule need to be dropped before reaching
custom attention code.

Verified on nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with tp4pp2
(100 training steps, hellaswag dataset, 8xH100).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* update recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* add recent recipes

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* some fix

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* somefix2

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add reranker training (#1449)

* moving from biencoder to encoder refactor

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* staging

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* train_encoder.py -> train_retriever_encoder.py

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* cross encoder recipe

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* cleaning up

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* dir refactor

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* separating out biencoder and crossencoder

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* lm_q -> model

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* updating configs and adding dataset tests for cross encoder

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* lint

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* bug fixes

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* changes

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* adding acc logging

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* te patches + lint

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* refactor: unify passage count config — rename train_n_passages to n_passages, derive from dataloader

Remove redundant top-level train_n_passages and eval_negative_size from YAML configs and recipe __init__. The recipe now derives train_n_passages and val_n_passages directly from the dataloader dataset config (n_passages), making the dataset config the single source of truth. Also removes dead train_n_passages param from CrossEncoderCollator and unused temperature from cross-encoder config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: rename encoder -> retrieval naming convention (#1536)

* refactor: rename encoder -> retrieval for directory umbrella

Rename recipes, examples, and _transformers module to use 'retrieval'
as the umbrella term for bi-encoder, cross-encoder, late-interaction,
and sparse encoder models. This aligns with IR community conventions
and the existing data-layer naming (retrieval_dataset, retrieval_collator).

Key changes:
- recipes/encoder/ -> recipes/retrieval/
- examples/encoder/ -> examples/retrieval/
- _transformers/encoder.py -> _transformers/retrieval.py
- encoder_collator.py -> retrieval_collator.py
- train_retriever_encoder.py -> train_bi_encoder.py
- All test imports updated to new paths

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

* refactor: rename classes to match retrieval directory convention

- _NeMoAutoModelEncoderBase -> _NeMoAutoModelForRetrievalBase
  (follows HF ForX pattern: ForCausalLM, ForSequenceClassification)
- TrainRetrieverEncoderRecipe -> TrainBiEncoderRecipe
  (matches train_bi_encoder.py filename)

EncoderModel, BiEncoderModel, CrossEncoderModel class names are
kept — they are architecturally precise per Jimmy Lin's framework.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

* refactor: rename RetrievalEncoderCollator -> BiEncoderCollator

Symmetric with CrossEncoderCollator — both collators named by the
model type they serve. Eliminates the mixed naming
(Retrieval + Encoder) that was inconsistent with the convention:
directories use task names (retrieval), classes use architecture
names (BiEncoder, CrossEncoder).

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

* refactor: rename encoder -> biencoder in test files, dirs, and checkpoint paths

Aligns test directory names, file names, YAML checkpoint paths, and
comments/docstrings with the source-code rename from encoder to biencoder.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

* refactor: standardize naming — bi_encoder/cross_encoder in snake_case, bi-encoder/cross-encoder in prose

Applies consistent naming convention across the codebase:
- snake_case (files, dirs, code): bi_encoder, cross_encoder
- PascalCase (classes): BiEncoder, CrossEncoder
- Prose (docs, comments): bi-encoder, cross-encoder

Updates model_type enum values, function names, YAML configs,
test files/dirs, checkpoint paths, and docstrings.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

* refactor: fix stale class references and NeMoAutoModelBiencoder casing

- Fix NeMoAutoModelBiencoder → NeMoAutoModelBiEncoder in test docstring
- Remove stale references to deleted classes (RetrievalMultiModalDatasetLoader,
  CrossEncoderMultiModalDatasetLoader) from docstrings
- Update docs/guides/llm/retrieval-dataset.md prose: encoder → bi-encoder

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

* fix: update stale config path and remove dropped cls_last pool type from test

- Fix default_config_path in train_bi_encoder.py main() from
  examples/encoder/ to examples/retrieval/ (missed in rename)
- Remove cls_last from test_pool_basic_modes parametrize since pool()
  no longer supports it

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Co-authored-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR #1449 review comments + remove EncoderModel base class

Bug fixes:
- Allow seed=0 in rng.py (was asserting seed > 0)
- Default eval_negative_size to n_passages-1 instead of hardcoded 10
- Make cross-encoder prompt template configurable via YAML
- Add temperature/pooling to cross-encoder YAML, fix pooling propagation
- Fix auto_map config class name for cross-encoder models

Refactor:
- Extract shared functions (build_encoder_backbone, save_encoder_pretrained,
  configure_encoder_metadata) from EncoderModel
- Refactor BiEncoderModel and CrossEncoderModel to standalone nn.Module
- Remove EncoderModel abstract base class
- Code simplification: DRY extractions, slop removal, comprehensions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert simplification changes to retrieval_collator.py, keep prompt template

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert simplification changes to dataset files, keep eval_negative_size fix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: revert simplification changes to rng.py, keep seed>=0 fix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use AutoModelForSequenceClassification key in auto_map for cross-encoders

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: update argparse eval_negative_size default from 10 to None

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix small bugs

* feat: support any HF model as encoder backbone via AutoModel fallback

build_encoder_backbone now falls back to AutoModel.from_pretrained (or
AutoModelForSequenceClassification for score task) when the model type
is not in SUPPORTED_BACKBONES, enabling use of any HF model like
Qwen/Qwen3-1.7B as an encoder backbone.

- Add retrieval tags to MODEL_ARCH_MAPPING so downstream code can
  distinguish custom retrieval models from generic HF models
- configure_encoder_metadata skips auto_map for generic HF models
- _init_encoder_common uses config.name_or_path for generic HF models
- Move HF Auto class registration into llama_bidirectional/model.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: derive auto_map module names dynamically instead of hardcoding "model."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use 'llama_bidirec' as model_type instead of class name

Saved checkpoints were getting model_type "LlamaBidirectionalModel"
which breaks the convention of short snake_case identifiers and doesn't
match SUPPORTED_BACKBONES. Now uses "llama_bidirec" which is already
handled by both SUPPORTED_BACKBONES and HF Auto class registration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci: add cross-encoder (reranker) functional test

Add an end-to-end CI test for the cross-encoder training recipe:
- 2-GPU FSDP2 training (32 steps) with quality evaluation
- Check 1: finetuned pos-score > baseline pos-score
- Check 2: ranking accuracy >= 75%
- Add main() entrypoint to train_cross_encoder.py for module invocation
- Register L2_Retrieval job in cicd-main.yml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* test: add cross-encoder unit tests for accuracy, batch_mrr, collator, and flatten

- Add 8 tests for accuracy() and batch_mrr() pure functions
- Add 3 collator tests: output keys/shapes, labels, pad_to_multiple_of
- Add 3 tests for flatten_bi_encoder_to_cross_encoder value validation
- Add 2 cross-encoder liger/SDPA retry tests; refactor retry helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* lint

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

* fix: use bare coverage run in cross-encoder test for proper data collection

Drop custom COVERAGE_ARGS (--data-file, --source, --parallel-mode) that
prevented coverage data from being collected. Matches the pattern used
by all other multi-GPU functional tests (L2_DCP, L2_HF_PEFT, etc.).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>

---------

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Co-authored-by: Ronay Ak <ronaya@nvidia.com>

* fix: fix NemotronHForCausalLM force_hf=True (#1625)

* align trust_remote and local transformers for NemotronHForCausalLM

Signed-off-by: Yuki Huang <yukih@nvidia.com>

* fix load model

Signed-off-by: Yuki Huang <yukih@nvidia.com>

---------

Signed-off-by: Yuki Huang <yukih@nvidia.com>

* fix: fix gradient_checkpointing overhead in transformers 5.3 (#1621)

fix gradient_checkpointing overhead in transformers 5.3

Signed-off-by: Yuki Huang <yukih@nvidia.com>

* feat: Migrate diffusion recipe to use Stateful Dataloader (#1630)

* feat: Migrate diffusion recipe to use Stateful Dataloader

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* Fix linting errors

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

---------

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* feat: add Nemotron Nano 4B SQuAD finetune recipe (#1624)

* feat: add Nemotron Nano 4B SQuAD finetune recipe

Signed-off-by: David <user_davidoneil@dgx-B200-2.cm.cluster>

* Update examples/llm_finetune/nemotron/nemotron_nano_4b_squad.yaml

Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

---------

Signed-off-by: David <user_davidoneil@dgx-B200-2.cm.cluster>
Co-authored-by: David <user_davidoneil@dgx-B200-2.cm.cluster>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

* feat: Ensure that diffusion training jobs use the safetensors checkpoint format (#1627)

* feat: Ensure that diffusion training jobs use the safetensors checkpoint format

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* Fix lint errors

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* Add tests for diffusers compatible checkpointing

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* Fix linting issue

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

---------

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>

* ci: Pass argument automodel dir for transformer version check (#1617)

Pass argument automodel dir

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* fix: from_pretrained with nested kwargs (e.g. text_config) crashes on VLM models (#1623)

* feat: add hybridep (#1333)

* feat: add hybridep

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Update uv lock

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* fixes

Signed-off-by: hemildesai <hemild@nvidia.com>

* fixes

Signed-off-by: hemildesai <hemild@nvidia.com>

* fixes

Signed-off-by: hemildesai <hemild@nvidia.com>

* fixes

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fixes

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fixes

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Address review comments: clean up Docker build artifacts and add unit tests

- Clean up DeepEP/, rdma-core/, and deepep.patch after Docker build to reduce image bloat
- Add unit tests for _HybridEPManager._indices_to_multihot covering basic,
  topk=1, all-invalid, partial-invalid, and single-token edge cases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Remove stale INSTALL_DEEPEP references from docs and CI

DeepEP is now always installed (no longer behind a build arg), so remove
the leftover INSTALL_DEEPEP references from CONTRIBUTING.md and the
build-container GitHub Action.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Fix typo in get_dispatched_metadata and document unused HybridEP params

- Rename get_dispached_metadata -> get_dispatched_metadata across all
  three dispatch manager classes
- Document that async_finish and allocate_on_comm_stream are not
  supported by the HybridEP backend (kept in signature for interface
  compatibility with callers)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Update secrets baseline after HybridEP changes

Regenerated .secrets.baseline to keep line numbers in sync with the
updated files. All entries are false positives (config keys, test
fixtures, high-entropy example strings).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: tied embedding v4 to v5 (#1631)

* fix test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add filter_forward_kwargs

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use filter_forward_kwargs

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add _get_checkpoint_tensor_dtypes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* preserve dtype

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* uodate tests

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update resolve

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* ci: Add deleted files explicitly in coverage omit (#1637)

Add deleted files explicitly in coverage omit

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

* feat: add AGENTS.md (#1638)

* add AGENTS.md

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add skills

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update reamde

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix: remove redundant _keep_in_fp32_modules for layer norms in GptOssForCausalLM (#1633)

* fix: remove redundant _keep_in_fp32_modules for layer norms in GptOssForCausalLM

Signed-off-by: stanley1208 <stanley.mei08@gmail.com>
Made-with: Cursor

* lint

---------

Signed-off-by: stanley1208 <stanley.mei08@gmail.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

* cp: feat: VLM pretokenized data pipeline with neat packing (#1618)

* feat: add neat packing (greedy knapsack) for LLM and VLM datasets

Implement sequence packing via min-heap first-fit-decreasing knapsack
for both LLM and VLM datasets, with indexed attention masks and flash
attention support. Includes unit tests and benchmarks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add LengthGroupedSampler for token-aware distributed sampling

Sort samples by estimated token length (text + media) and shuffle
within buckets to keep batch-internal lengths similar, reducing padding
waste. Includes accurate image/video token count estimation via
smart_resize and comprehensive test suite.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: integrate neat packing strategy into LLM finetune recipe

Add packing_strategy config field ("neat" or "thd") to select between
greedy knapsack packing and existing THD packing in the LLM recipe.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* chore: remove benchmark scripts not needed for this PR

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: lint errors and broken sampler tests

Remove unused import and variable in neat_packing_vlm.py.
Fix 13 sampler tests that referenced non-existent bucket_size
and shuffle_bucket_size parameters.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: fix all ruff lint errors across changed files

Sort imports, remove unused imports/variables, fix f-strings
without placeholders, rename ambiguous variable name.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: run ruff format on all changed source and test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* style: add missing copyright headers to test files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add meta-dataset loading system with ShareGPT format support

Implement LLaMA-Factory style meta JSON dataset loading with support
for multiple dataset composition, sampling ratios, ShareGPT format
conversion, LMDB image storage, video frame reading via decord, media
preloading, and cross-rank data sharding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add RobustDatasetWrapper with retry and fake image injection

RobustDatasetWrapper provides data loading error retry, media
preloading, and fake image injection to prevent FSDP/Zero3 hangs on
pure-text batches. PreTokenizedDatasetWrapper supports per-sample
tokenization in DataLoader workers with overlong sample detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* enhance: refactor label building with template-based approach

Replace BPE context-sensitive pattern matching with token ID-level
scanning (build_labels_from_template) for reliable assistant turn
detection. Remove qwen2_5 dependency on qwen_vl_utils. Add per-sample
media counts (n_images_per_sample/n_videos_per_sample) to collate
output for precise PP chunking. Replace truncation with pre-filtering
via _drop_overlong_samples. Use decord as video backend globally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: simplify video timestamp handling with VideoMetadata

Replace the manual _fix_video_timestamps regex approach with
_build_video_metadata that passes metadata directly to the processor.
Also adds second_per_grid_ts to output keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add precompute_tokens script for offline tokenization

Offline parallel tokenization tool that writes _text_tokens counts
to dataset samples, enabling LengthGroupedSampler to use exact token
counts instead of heuristic estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: wire up configure_packing and attn-aware collaters for neat packing

Wire up configure_packing and attn-aware collaters into both LLM and VLM
recipes so neat packing correctly enforces per-document attention
boundaries with flash_attention_2 and SDPA.

Changes:
- neat_packed_collater: accept attn_implementation param, keep 2D indexed
  mask for flash, 4D bool block-causal mask for SDPA
- configure_packing: patch create_causal_mask in qwen2/qwen2_5_vl/qwen2_vl/
  qwen3_vl/qwen3_vl_moe modules via importlib loop
- LLM recipe: call configure_packing when packing_strategy=neat, detect
  attn backend from cfg_model (backend.attn or attn_implementation)
- VLM recipe: add pretokenize + packing path to build_dataloader with
  cfg_model param, same attn detection logic
- Add 3 example recipes: LLM neat packing, VLM 4B neat packing,
  VLM MoE 30B neat packing

Tested:
- VLM Qwen3-VL-4B flash: 4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-4B sdpa:  4.19 -> 1.47 -> 0.49
- VLM Qwen3-VL-30B MoE flash: 1.76 -> 0.41 -> 0.10
- LLM Qwen2.5-0.5B flash+force_hf: 3.72 -> ... -> 2.84

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move VLM packing config to top-level packed_sequence section

Move packing configuration from nested dataset.packing to a top-level
packed_sequence: section, matching the LLM recipe pattern. This decouples
dataset definition from packing strategy.

The VLM recipe's build_dataloader now accepts cfg_ps and reads packing
config from there first, falling back to legacy dataset.packing for
backward compatibility.

Additional fixes from merge:
- Fix stale build_labels() call in collate_fns.py (merge artifact)
- Revert phi4/kimi collate to use build_labels (not in _IMSTART allowlist)
- Comment out decord2 monkey-patch (user removed it for torchcodec testing)
- Add TODO on _PACKING_PATCH_MODULES about generality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Switch VLM neat packing example to MedPix-VQA dataset with 8k seqlen

Use HF dataset (mmoukouba/MedPix-VQA) instead of local mockdata to
demonstrate packed_sequence working with standard HF datasets.
Increase pack_size/max_length to 8192 for real image samples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: deduplicate robust_collate into make_robust_collate

Extract the duplicated collate retry logic from PreTokenizedDatasetWrapper
and RobustDatasetWrapper into a shared make_robust_collate() function in
collate_fns.py. Both classes now delegate to it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* refactor: move media I/O helpers from datasets.py to utils.py

Move _resolve_lmdb_image, _read_video_frames, _preload_media, and
_build_video_metadata to vlm/utils.py. These are generic media utilities
not tied to any specific dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: move random import to module level, allow pretokenize without packing

- Move `import random` from inside make_robust_collate to module-level
  import in collate_fns.py
- Read pretokenize/max_length from cfg_ps regardless of pack_size,
  enabling pretokenize-only mode without packing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* cleanup: remove verbose comments from packing recipe yamls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* feat: add Qwen3.5-4B VLM neat packing recipe

Tested with 8 GPUs, 8k pack_size, MedPix-VQA dataset.
Requires transformers >= 5.3.0 for Qwen3.5 support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: add qwen3_5 to packing patch modules and fix missing import

- Add transformers.models.qwen3_5.modeling_qwen3_5 to packing patch list
  so create_causal_mask is patched for Qwen3.5 dense models
- Fix _passthrough_create_causal_mask signature to accept both
  input_embeds and inputs_embeds (HF 5.3.0 uses inputs_embeds)
- Import _lmdb_env_cache from utils.py in datasets.py (missed in
  earlier media helpers refactor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove LLM recipe from VLM data pipeline PR

This LLM recipe doesn't belong in the VLM packing PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* fix: update test imports after media helpers move to utils.py

Update test_datasets.py to import _read_video_frames and
_preload_media from vlm/utils.py instead of vlm/datasets.py.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add unit tests for packing, utils, and collate changes

New test files:
- test_utils.py: _resolve_lmdb_image (cache, missing key, RGB),
  _build_video_metadata (empty, no video, preserved fields)
- test_packing.py: get_seqlens_in_batch, get_unpad_data,
  _passthrough_create_causal_mask (both HF signatures),
  get_attn_implementation (backend vs HF config),
  configure_packing (noop for sdpa, patches FA2 modules)

Extended test_collate_fns.py:
- make_robust_collate (success, retry, max_retries exhausted)
- neat_packed_vlm_collater attn_implementation variants
  (2D mask for FA2, 4D for sdpa, fixed max_length, pixel_values concat)

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: lint errors and missing copyright headers

- ruff fix: remove unused imports (copy, BaseVideoProcessor, load_video,
  as_completed), unused variables (grid_idx, total_text_tokens,
  total_media_tokens), fix import ordering
- Add copyright headers to scripts/precompute_tokens.py and
  tests/test_meta_dataset_all.py

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: ruff format on all changed files

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename test_utils.py to avoid pytest collection conflict

tests/unit_tests/datasets/test_utils.py already exists; having
test_utils.py in the vlm/ subdirectory causes a module name collision.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: configure cfg_ds.get defaults in build_dataloader tests

MagicMock().get() returns a truthy MagicMock by default, which
incorrectly triggers the pretokenize path. Configure side_effect
to return proper defaults for packing-related keys.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: make packing mask patch safe for non-packed forward passes

_passthrough_create_causal_mask now checks whether the attention mask
is actually a packed mask (4D or indexed with values > 1) before
returning it as-is. For normal 2D masks (standard training), it
delegates to the original HF create_causal_mask, preventing test
pollution where the monkey-patch breaks non-packed Qwen2 tests.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: passthrough causal mask for FA2 to avoid breaking validation

The previous logic delegated all non-packed 2D masks to HF's
create_causal_mask, which produced a mask incompatible with
flash_attention_2 during validation. FA2 handles causal masking
internally, so always pass through. Delegation to HF is now
limited to non-FA2 backends (sdpa/eager) where it is needed.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review feedback from claude[bot]

- Fix wrong import: _resolve_lmdb_image lives in utils.py not datasets.py
- Assign unused sum() results to variables in dataset timing summary
- Fix fake_indices bug: _drop_overlong_samples now returns kept indices
  so callers can filter examples in sync with conversations

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: log actual processor type before falling back to default

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: remove unused sum() variables flagged by ruff F841

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add validation and max_steps to VLM packing recipes

- Add validation_dataset (MedPix-VQA) and validation_dataloader to
  qwen3_vl_4b and qwen3_vl_moe_30b recipes
- Add max_steps: 100 to both recipes
- Switch MoE recipe from mockdata to MedPix-VQA with pack_size 8192

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: enable checkpoint with safetensors in qwen3_vl_4b recipe

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: CLI app and launching (#1406)

* refactor CLI app

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update tests

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add breakin changes log

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add deprecation messages

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update readme

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update docs

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add launch

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* drop nemo-run

oot cause: nemo-run pins cryptography<43.0.0, but your project constrains cryptography>=46.0.5 (CVE fix). These are fundamentally incompatible, so uv lock can't resolve any extra that depends on nemo_run.
Fix:
Removed nemo_run from the cli extra -- it's now just ["pyyaml"]
Removed the standalone nemo-run extra entirely (it was also unresolvable)
Updated BREAKING_CHANGES.md, installation guide, and cluster guide to note that nemo-run should be installed separately (pip install nemo-run) if needed
The SLURM and k8s launchers don't need nemo-run at all (they shell out to sbatch/kubectl), and the NemoRunLauncher already does a runtime import check, so users who need it just install nemo-run directly.

* update examples and messages

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Update uv lock

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* mention uv run am in docs

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* lint

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add tests

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Update docs/launcher/cluster.md

Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>

* Update docs/launcher/cluster.md

Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>

* Update BREAKING_CHANGES.md

Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>

* Update docs/launcher/local-workstation.md

Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>

* Update docs/launcher/cluster.md

Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>

* support torchrun

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* :latest

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* proc-per-node

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add recipe target

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update docs

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Update uv lock

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* lint

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Update uv lock

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* add ability to external path and slurm example script

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix url

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* ty

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* enable recipe to be a string in addition to fqn

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update yamls

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Update cli/app.py

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Update cli/app.py

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Update examples/llm_finetune/finetune.py

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* add test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add missing recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* simplofy cli/app.py

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update secrets.baseline

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Update BREAKING_CHANGES.md

Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>

* Update checkpointing.md

* move cli inside nemo_automodel to avoid name collisions

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* deprecate slurm launcher; rely on user script

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add app.py shim

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update secrets

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update job launchers

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* deprecate k8s in favor of skypilot

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Update nemo_automodel/components/launcher/slurm/launcher.py

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Update nemo_automodel/components/launcher/interactive.py

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* Update examples/llm_finetune/finetune.py

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* keep only slurm.sub and remove launcher

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add recipe field

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* exclude

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* consistency is key

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* adding missing recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

* feat: add missing recipe in yaml (#1642)

add missing recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* ci: Update run time for nemotron super ci (#1614)

Update run time for nemotron super ci

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* ci: Update mistral4 medpix ci run time (#1646)

Update mistral4 medpix ci run time

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

---------

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Signed-off-by: David <user_davidoneil@dgx-B200-2.cm.cluster>
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
Signed-off-by: stanley1208 <stanley.mei08@gmail.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Adil <47084919+adil-a@users.noreply.github.com>
Co-authored-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Co-authored-by: Ronay Ak <ronaya@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Pranav Thombre <pthombre@nvidia.com>
Co-authored-by: David O'Neil <134946410+davidoneilai@users.noreply.github.com>
Co-authored-by: David <user_davidoneil@dgx-B200-2.cm.cluster>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: alexchiu <alexq@nvidia.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: stanley1208 <54892792+stanley1208@users.noreply.github.com>
Co-authored-by: zhiqil <zhiqil@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: Andrew Chen <chenopis@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
@akoumpa akoumpa mentioned this pull request Apr 2, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants