Skip to content

[Improvement] Update DistributedLengthGroupedSampler to allow customizing length function#43363

Open
Dogacel wants to merge 1 commit intohuggingface:mainfrom
Dogacel:distributed-length-grouped-sampling-generalization
Open

[Improvement] Update DistributedLengthGroupedSampler to allow customizing length function#43363
Dogacel wants to merge 1 commit intohuggingface:mainfrom
Dogacel:distributed-length-grouped-sampling-generalization

Conversation

@Dogacel
Copy link
Copy Markdown

@Dogacel Dogacel commented Jan 20, 2026

What does this PR do?

  1. Allow passing a custom length_func to the DistributedLengthGroupedSampler to support length grouping complex data.
  2. Allow passing mega_batch_mult for fine-grained control of internal batching.
  3. Add doc-strings on how to use the sampler.
  4. Add unit tests, (1) using sampler with regular data that has input_ids with first dimension as length, (2) using sampler with custom length function with custom data.

Motivation: We had inputs shaped as {"input_ids": torch.Tensor(1, seq_len)}, therefore calling len(...) resulted in returning 1 instead of seq_len, causing length grouping to fail.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed.

@SunMarc @3outeille

Also I would like to know if you are interested in adding documentation for this class in the website as I find it pretty useful, however website never mentions this and I foudn it by chance while reading the source and issues.

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43363&sha=0e9367

@Dogacel
Copy link
Copy Markdown
Author

Dogacel commented Mar 26, 2026

Can I get some 👀 or guidence on how to move forward?

@SunMarc , @3outeille

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant