[`Whisper`] Add conversion script for the tokenizer by ArthurZucker · Pull Request #27338 · huggingface/transformers

ArthurZucker · 2023-11-07T09:33:25Z

What does this PR do?

Aligned with #27336 this PR adds the conversion of the tokenizer form tiktoken to transformers

…28de0182b17605a98631ee`

HuggingFaceDocBuilderDev · 2023-11-07T10:20:58Z

The documentation is not available anymore as the PR was closed or merged.

Co-authored-by: Joshua Lochner <admin@xenova.com>

amyeroberts

Thanks for adding!

…er-v3-nots

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

ArthurZucker · 2023-11-07T13:17:58Z

-This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).
-The original code can be found [here](https://github.com/openai/whisper).


was duplicated in #26834

sanchit-gandhi

Thanks for the speedy support @ArthurZucker!

sanchit-gandhi · 2023-11-07T13:27:14Z

+            for bpe_tokens in merges:
+                writer.write(bpe_tokens + "\n")
+
+        hf_tokenizer = WhisperTokenizer(vocab_file, merge_file)


Do we need to convert the fast tokenizer as well? Or all good with just the slow?

Fast can always be converted from slow when loading with autoTokenizer, so no need I'd say but can add a comment

sanchit-gandhi · 2023-11-07T13:27:49Z

+    )
    args = parser.parse_args()

+    if args.convert_tokenizer:


To me it's more intuitive to always convert the tokenizer, since we can't use the model without it

Yes but not BC because this requires tiktoken

sanchit-gandhi · 2023-11-07T13:28:16Z

+        else:
+            from tiktoken.load import load_tiktoken_bpe
+
+            NUM_LANGUAGES_PER_RELEASE = {1: 99, 2: 99, 3: 100}


These could be fetched from the model metadata no? Rather than having the user input?

I decided to not use the full model's data to keep it seperate, otherwise I have to either change the conversion function with a new argument or fetch the full tokenizer, which requires whisper (the package). Think this is simpler IMO

….github.com>

HuggingFaceDocBuilderDev · 2023-11-07T14:08:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

* draft * updates * full conversion taken from `https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee` * psuh * nits * updates * more nits * Add co author Co-authored-by: Joshua Lochner <admin@xenova.com> * fixup * cleanup * styling * add proper path * update * nits * don't push the exit * clean * update whisper doc * don't error out if tiktoken is not here * make sure we are BC with conversion * nit * Update docs/source/en/model_doc/whisper.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * merge and update * update markdwon * Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> --------- Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

draft

63a860c

ArthurZucker mentioned this pull request Nov 7, 2023

[Whisper] Add large-v3 version support #27336

Merged

5 tasks

ArthurZucker added 3 commits November 7, 2023 10:51

updates

8e04b79

full conversion taken from https://gist.github.com/xenova/a452a64744…

f1b3b5b

…28de0182b17605a98631ee`

psuh

412bd98

nits

deb624a

ArthurZucker force-pushed the whisper-v3-nots branch from d1c25fa to deb624a Compare November 7, 2023 10:22

ArthurZucker and others added 12 commits November 7, 2023 11:31

updates

491352c

more nits

4ddc1e9

Add co author

16d0152

Co-authored-by: Joshua Lochner <admin@xenova.com>

fixup

5f63066

cleanup

1512ea1

styling

26aa3ab

add proper path

10e611f

update

18325b1

nits

8a9c954

don't push the exit

a78464c

clean

7f826a7

update whisper doc

fcbe06b

ArthurZucker marked this pull request as ready for review November 7, 2023 11:23

ArthurZucker added 3 commits November 7, 2023 12:33

don't error out if tiktoken is not here

c7cbe88

make sure we are BC with conversion

1efcd15

nit

608ca8c

ArthurZucker commented Nov 7, 2023

View reviewed changes

Comment thread src/transformers/models/whisper/tokenization_whisper.py

ArthurZucker requested review from amyeroberts and sanchit-gandhi November 7, 2023 11:43

amyeroberts approved these changes Nov 7, 2023

View reviewed changes

Comment thread docs/source/en/model_doc/whisper.md Outdated

Comment thread src/transformers/models/whisper/convert_openai_to_hf.py Outdated

Comment thread src/transformers/models/whisper/tokenization_whisper.py

ArthurZucker and others added 2 commits November 7, 2023 13:41

Merge branch 'main' of github.com:huggingface/transformers into whisp…

8eedcb0

…er-v3-nots

Update docs/source/en/model_doc/whisper.md

d9eeee8

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts mentioned this pull request Nov 7, 2023

OpenAI to HF: Add large-v3 to conversion script #27335

Closed

5 tasks

ArthurZucker added 2 commits November 7, 2023 14:07

merge and update

2350a28

update markdwon

47c3065

ArthurZucker commented Nov 7, 2023

View reviewed changes

sanchit-gandhi approved these changes Nov 7, 2023

View reviewed changes

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply…

ea11820

….github.com>

ArthurZucker merged commit 88832c0 into main Nov 7, 2023

ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:07

ArthurZucker restored the whisper-v3-nots branch November 7, 2023 14:08

ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:08

ArthurZucker restored the whisper-v3-nots branch November 7, 2023 14:08

ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:08

		This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).
		The original code can be found [here](https://github.com/openai/whisper).

Conversation

ArthurZucker commented Nov 7, 2023

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanchit-gandhi Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuggingFaceDocBuilderDev commented Nov 7, 2023 •

edited

Loading

ArthurZucker Nov 7, 2023 •

edited

Loading