feature: Add robust token counting with padding exclusion #40416
Merged
ArthurZucker merged 6 commits intohuggingface:mainfrom Sep 11, 2025
Merged
Conversation
…ens_seen variable and kept bool for backward compatibility and added string also to ensure everything goes well and kept default as is. also robust test cases are created
…t and also solved code quality issue
Contributor
Author
|
Hello, I made changes and our feature test case is successful. I am working on passing on checks, I noticed in my first commit it gave me success in run_tests but because of code_quality it failed and I solved it then in 3rd, 4th, 5th commit there I am getting inconsistent result in run_tests in terms of number of failed --> 3, 2, 1 respectively. Is this because environment issue or what can be? |
Member
|
cc @SunMarc |
|
Thank you all! |
vijayabhaskar-ev
pushed a commit
to vijayabhaskar-ev/transformers
that referenced
this pull request
Oct 2, 2025
…e#40416) * created robust token counting by using existing include_num_input_tokens_seen variable and kept bool for backward compatibility and added string also to ensure everything goes well and kept default as is. also robust test cases are created * some codebase mismatched in my local and remote, commiting to solve it and also solved code quality issue * ci: retrigger tests * another attemp to trigger CI for checks
yuchenxie4645
pushed a commit
to yuchenxie4645/transformers
that referenced
this pull request
Oct 4, 2025
…e#40416) * created robust token counting by using existing include_num_input_tokens_seen variable and kept bool for backward compatibility and added string also to ensure everything goes well and kept default as is. also robust test cases are created * some codebase mismatched in my local and remote, commiting to solve it and also solved code quality issue * ci: retrigger tests * another attemp to trigger CI for checks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #40401
This pull request improves the Trainer by adding a better way to count input tokens. It includes a new option to exclude padding. This is done by expanding the functionality of the current include_num_input_tokens_seen argument in TrainingArguments, ensuring full backward compatibility.
What was the feature?
The goal was to give users more precise control over how input tokens are counted during training. This feature allows excluding padding tokens from the total count. This is useful for accurate logging and performance analysis, especially in tasks with variable sequence lengths.
What was done and why?
To implement this effectively without adding unnecessary new parameters (bool flag), the following changes were made:
Updated Existing Parameter: The include_num_input_tokens_seen argument in TrainingArguments was updated to accept string values ("all", "non_padding") in addition to boolean values. This allows for clearer control while keeping full backward compatibility (True is mapped to "all," and False to "no").
Improved Counting Logic: The Trainer's token counting logic was made more reliable. When "non_padding" is selected, the Trainer now follows a prioritized approach:
It first tries to use attention_mask.sum() for the most accurate count of non-padded tokens.
If attention_mask is not available, it counts tokens where input_ids are not equal to the pad_token_id.
If neither method works, it counts all tokens and logs a warning to inform the user.
Testing:
To ensure the reliability of this feature, a thorough test suite has been added to tests/trainer/test_trainer.py. The new tests cover:
All token counting modes ("all," "non_padding," True, False).
The new fallback logic, with specific test cases for when attention_mask is present, when it is absent (falling back to pad_token_id), and when neither is available (testing the warning and fallback to counting all tokens).
Full backward compatibility.
I noticed torch_dtype is replaced by dtype so in our files I made them manual changes so no issues will be created to merge it. #39782
Also I clicked on Update Branch button.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Models:
Library: