Ensure compatibility between models and datasets by jlamypoirier · Pull Request #402 · ServiceNow/Fast-LLM

jlamypoirier · 2025-12-04T06:16:46Z

✨ Description

When specifying a dataset and model in Fast-LLM, it's not currently possible to tell whether the two are compatible. This PR mitigates the issue by adding runtime checks. The dataset preparator saves the relevant preprocessing options with the dataset, and the memmap dataset compares it against the options required by the model, forwarded through the new preprocessing field of SamplingData. Additional benefits:

Added flexibility in the dataset content, i.e. the dataset will produce trivial loss masking spans and/or images patches if missing instead of crashing, see Text-only training of multimodal models #403. Added a warning in case that's unintended.
The dataset reads only what it needs, ex. a dataset with images won't read them for text-only models.

Should be backward compatible, the older dataset will just warn that they can't check compatibility. (Haven't tested though.)

So far this checks vocab size, use_loss_masking_spans, use_preference_spans, use_image_patches and patch shape. Missing the actual tokenizer, max image shape and image special tokens as it would require extra work and additional fields in the training config.

Also address multiple test failures from recent PRs

(from Add stochastic mixer for supernet training #373) Fix checkpoint import/export for pipeline-parallel configurations with tied weights. The issue caused a failure in apriel 2 checkpoint conversion tests, and was previously unknown because we never tried conversion with tied weights.
(From GDN mixer #392) Mark conversion as broken for hybrid_gdn since it's crashing. @oleksost
(From activation-level disillation #388, GDN mixer #392) Fix skip_test for hybrid_gdn (was not excluding all tp configs), mistral_distill_activations (missing "df", "bf", "fp16")

And some maintenance.

Cleanup some model configs (especially skip_tests), mark hybrid_mamba model config as unimportant
Remove some outdated todos

That leaves 5 failing tests:

FAILED tests/models/test_checkpoint.py::test_huggingface_model[hybrid_mamba_2]@dependency_group_2 - AttributeError: 'NoneType' object has no attribute 'ssm_states'
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-auto] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-fused] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-triton] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-torch] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242

The first one has been there for a long time, the remaining 4 concern reverse KL (#400) and are being investigated by @oleksost

…cessing

tscholak

LGTM

oleksost

Looks good!
The 4 failing tests concerning reverse KL were fixed in #405.

…cessing

jlamypoirier added 3 commits December 3, 2025 20:25

Fix rotary 2d

2ab1825

stuff

8305dd5

stuff

b6e38b8

Base automatically changed from jlp/fix_2d_rotary to main December 4, 2025 14:48

jlamypoirier mentioned this pull request Dec 4, 2025

Text-only training of multimodal models #403

Closed

25 tasks

jlamypoirier added 5 commits December 4, 2025 18:26

Merge branch 'main' into jlp/consistent_preprocessing

72f915d

stuff

350fb3d

fix

d27a815

Merge branch 'main' into jlp/consistent_preprocessing

72f3a31

fixes

5ab6cd0

jlamypoirier marked this pull request as ready for review December 8, 2025 20:01

jlamypoirier requested review from oleksost and tscholak December 8, 2025 20:06

Merge remote-tracking branch 'origin/main' into jlp/consistent_prepro…

1e74469

…cessing

tscholak approved these changes Dec 9, 2025

View reviewed changes

oleksost approved these changes Dec 9, 2025

View reviewed changes

jlamypoirier added 3 commits December 9, 2025 18:44

Merge remote-tracking branch 'origin/main' into jlp/consistent_prepro…

355af7c

…cessing

fix

660fecc

fixes

db93bb5

jlamypoirier merged commit 3b50720 into main Dec 10, 2025
2 checks passed

jlamypoirier deleted the jlp/consistent_preprocessing branch December 10, 2025 01:33

jlamypoirier mentioned this pull request Dec 10, 2025

Varlen and testing tweaks #408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure compatibility between models and datasets#402

Ensure compatibility between models and datasets#402
jlamypoirier merged 12 commits intomainfrom
jlp/consistent_preprocessing

jlamypoirier commented Dec 4, 2025 •

edited

Loading

Uh oh!

tscholak left a comment

Uh oh!

oleksost left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jlamypoirier commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

oleksost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jlamypoirier commented Dec 4, 2025 •

edited

Loading