Fix tokenizer check script: safe dataset access, default checkpoints, and tested in dry-run mode by aijadugar · Pull Request #41698 · huggingface/transformers

aijadugar · 2025-10-17T15:56:29Z

what this PR does and summary

This PR improves the scripts/check_tokenizers.py script by:

Safe dataset access:
- Handles dataset[i]["premise"] and dataset[i]["hypothesis"] safely for the English XNLI dataset.
- Avoids errors with dictionary-like data for multi-language datasets.
Default checkpoints per tokenizer:
- Provides safe default checkpoints to avoid class mismatch warnings.
- Skips tokenizers without a defined checkpoint.
Dry-run tested:
- Tested only on a small subset (5 samples) by default.
- Avoids excessive downloading and testing during development.
Counters reset per tokenizer:
- total, perfect, imperfect, wrong counters reset for each tokenizer run.
- Accurate reporting of tokenizer comparison results.
Improved error handling:
- Skips tokenizers gracefully when dependencies are missing or errors occur.

Example Output

… and tested in dry-run mode

aijadugar

hello team, PR is ready for review...

Rocketknight1 · 2025-10-20T13:04:02Z

We're not sure we're keeping that script, unfortunately! I don't want to start merging fixes to it right now.

aijadugar · 2025-10-20T13:32:26Z

ok @Rocketknight1 , what I do now?

Fix tokenizer check script: safe dataset access, default checkpoints,…

c51161b

… and tested in dry-run mode

aijadugar commented Oct 18, 2025

View reviewed changes

Merge branch 'main' into fix/check_tokenizers

d45b45a

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer check script: safe dataset access, default checkpoints, and tested in dry-run mode#41698

Fix tokenizer check script: safe dataset access, default checkpoints, and tested in dry-run mode#41698
aijadugar wants to merge 2 commits intohuggingface:mainfrom
aijadugar:fix/check_tokenizers

aijadugar commented Oct 17, 2025

Uh oh!

aijadugar left a comment

Uh oh!

Rocketknight1 commented Oct 20, 2025

Uh oh!

aijadugar commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aijadugar commented Oct 17, 2025

what this PR does and summary

Example Output

Uh oh!

aijadugar left a comment

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Oct 20, 2025

Uh oh!

aijadugar commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants