[Whisper] Add conversion script for the tokenizer #27338
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
d1c25fa to
deb624a
Compare
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
| This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts). | ||
| The original code can be found [here](https://github.com/openai/whisper). |
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Thanks for the speedy support @ArthurZucker!
| for bpe_tokens in merges: | ||
| writer.write(bpe_tokens + "\n") | ||
|
|
||
| hf_tokenizer = WhisperTokenizer(vocab_file, merge_file) |
There was a problem hiding this comment.
Do we need to convert the fast tokenizer as well? Or all good with just the slow?
There was a problem hiding this comment.
Fast can always be converted from slow when loading with autoTokenizer, so no need I'd say but can add a comment
| ) | ||
| args = parser.parse_args() | ||
|
|
||
| if args.convert_tokenizer: |
There was a problem hiding this comment.
To me it's more intuitive to always convert the tokenizer, since we can't use the model without it
There was a problem hiding this comment.
Yes but not BC because this requires tiktoken
| else: | ||
| from tiktoken.load import load_tiktoken_bpe | ||
|
|
||
| NUM_LANGUAGES_PER_RELEASE = {1: 99, 2: 99, 3: 100} |
There was a problem hiding this comment.
These could be fetched from the model metadata no? Rather than having the user input?
There was a problem hiding this comment.
I decided to not use the full model's data to keep it seperate, otherwise I have to either change the conversion function with a new argument or fetch the full tokenizer, which requires whisper (the package). Think this is simpler IMO
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
* draft * updates * full conversion taken from `https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee` * psuh * nits * updates * more nits * Add co author Co-authored-by: Joshua Lochner <admin@xenova.com> * fixup * cleanup * styling * add proper path * update * nits * don't push the exit * clean * update whisper doc * don't error out if tiktoken is not here * make sure we are BC with conversion * nit * Update docs/source/en/model_doc/whisper.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * merge and update * update markdwon * Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> --------- Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
Aligned with #27336 this PR adds the conversion of the tokenizer form
tiktokentotransformers