Skip to content

Fix tokenizer check script: safe dataset access, default checkpoints, and tested in dry-run mode#41698

Open
aijadugar wants to merge 2 commits intohuggingface:mainfrom
aijadugar:fix/check_tokenizers
Open

Fix tokenizer check script: safe dataset access, default checkpoints, and tested in dry-run mode#41698
aijadugar wants to merge 2 commits intohuggingface:mainfrom
aijadugar:fix/check_tokenizers

Conversation

@aijadugar
Copy link
Copy Markdown

what this PR does and summary

This PR improves the scripts/check_tokenizers.py script by:

  1. Safe dataset access:

    • Handles dataset[i]["premise"] and dataset[i]["hypothesis"] safely for the English XNLI dataset.
    • Avoids errors with dictionary-like data for multi-language datasets.
  2. Default checkpoints per tokenizer:

    • Provides safe default checkpoints to avoid class mismatch warnings.
    • Skips tokenizers without a defined checkpoint.
  3. Dry-run tested:

    • Tested only on a small subset (5 samples) by default.
    • Avoids excessive downloading and testing during development.
  4. Counters reset per tokenizer:

    • total, perfect, imperfect, wrong counters reset for each tokenizer run.
    • Accurate reporting of tokenizer comparison results.
  5. Improved error handling:

    • Skips tokenizers gracefully when dependencies are missing or errors occur.

Example Output

image

Copy link
Copy Markdown
Author

@aijadugar aijadugar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello team, PR is ready for review...

@Rocketknight1
Copy link
Copy Markdown
Member

We're not sure we're keeping that script, unfortunately! I don't want to start merging fixes to it right now.

@aijadugar
Copy link
Copy Markdown
Author

ok @Rocketknight1 , what I do now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants