Skip to content

Ensure compatibility between models and datasets#402

Merged
jlamypoirier merged 12 commits intomainfrom
jlp/consistent_preprocessing
Dec 10, 2025
Merged

Ensure compatibility between models and datasets#402
jlamypoirier merged 12 commits intomainfrom
jlp/consistent_preprocessing

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

@jlamypoirier jlamypoirier commented Dec 4, 2025

✨ Description

When specifying a dataset and model in Fast-LLM, it's not currently possible to tell whether the two are compatible. This PR mitigates the issue by adding runtime checks. The dataset preparator saves the relevant preprocessing options with the dataset, and the memmap dataset compares it against the options required by the model, forwarded through the new preprocessing field of SamplingData. Additional benefits:

  • Added flexibility in the dataset content, i.e. the dataset will produce trivial loss masking spans and/or images patches if missing instead of crashing, see Text-only training of multimodal models #403. Added a warning in case that's unintended.
  • The dataset reads only what it needs, ex. a dataset with images won't read them for text-only models.

Should be backward compatible, the older dataset will just warn that they can't check compatibility. (Haven't tested though.)

So far this checks vocab size, use_loss_masking_spans, use_preference_spans, use_image_patches and patch shape. Missing the actual tokenizer, max image shape and image special tokens as it would require extra work and additional fields in the training config.

Also address multiple test failures from recent PRs

And some maintenance.

  • Cleanup some model configs (especially skip_tests), mark hybrid_mamba model config as unimportant
  • Remove some outdated todos

That leaves 5 failing tests:

FAILED tests/models/test_checkpoint.py::test_huggingface_model[hybrid_mamba_2]@dependency_group_2 - AttributeError: 'NoneType' object has no attribute 'ssm_states'
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-auto] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-fused] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-triton] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242
FAILED tests/layers/test_lm_head.py::test_lm_head[config_dict12-distributed_config_dict12-True-1-torch] - AssertionError: Rms diff too big (4.942e-01 > 1.000e-05) between tensors 0.9908757209777832 and 0.4966764450073242

The first one has been there for a long time, the remaining 4 concern reverse KL (#400) and are being investigated by @oleksost

Base automatically changed from jlp/fix_2d_rotary to main December 4, 2025 14:48
@jlamypoirier jlamypoirier marked this pull request as ready for review December 8, 2025 20:01
Copy link
Copy Markdown
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@oleksost oleksost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
The 4 failing tests concerning reverse KL were fixed in #405.

@jlamypoirier jlamypoirier merged commit 3b50720 into main Dec 10, 2025
2 checks passed
@jlamypoirier jlamypoirier deleted the jlp/consistent_preprocessing branch December 10, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants