Use the target tokenizer in text generation pipeline by AmitMY · Pull Request #16049 · huggingface/transformers

AmitMY · 2022-03-10T15:03:09Z

What does this PR do?

A subset of #15946
This PR addresses the issue with pipelines in decoding text of different target side tokenizer

This is split from the main PR because this affects the huggingface model hub and API, while the other code is applicable to run in a local copy

HuggingFaceDocBuilderDev · 2022-03-10T15:08:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Narsil · 2022-03-10T16:44:28Z

Hi @AmitMY ,

Thanks for this PR ! It would be nice to get it working with Marian correctly.

Unfortunately, this approach cannot work inside within pipelines.

.as_target_tokenizer() does not exist in most tokenizers, and therefore cannot be used (pipeline code is very generic), It's also mutating internal state which is likely going to cause issues in threaded/multiprocessing context.

Here I can think of a better approach if we do indeed need two very distinct tokenizers (it seems to be the case, but the solution for #15946 might change the direction).

The way the pipeline is set, we would need to define two tokenizer, a preprocess_tokenizer and a postprocess_tokenizer (or source/target). Each would be unique, and be used accordingly.

By default since most models use the same tokenizer, we could just set both variables to the same object and that should solve the issue at hand.

Before jumping to coding this, I think we should fix the first issue at hand #16050.

Btw, in that issue you mention Mbart, but in the code you modify Marian.

And a note for readers: there's also an open issue to support fast Marian tokenizers #15982 which could end up being linked to this work.

patil-suraj · 2022-03-10T18:40:44Z

Not reviewing but commenting for information.

.as_target_tokenizer() does not exist in most tokenizers, and therefore cannot be used

@Narsil as_target_tokenizer defined in PreTrainedTokenizerBase so it should be available for all tokenizers. By default it's a no op. cf

transformers/src/transformers/tokenization_utils_base.py

Lines 3405 to 3411 in e66743e

    
               def as_target_tokenizer(self): 
        
                   """ 
        
                   Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to 
        
                   sequence-to-sequence models that need a slightly different processing for the labels. 
        
                   """ 
        
                   yield

By default since most models use the same tokenizer, we could just set both variables to the same object and that should solve the issue at hand.

Right now I think there are only two models that support two vocabs: FSMT and Marian (#15831), but for these two tokenizers it's not necessary to call as_target_tokenizer since they use the target tokenizer in .decode/batch_decode

github-actions · 2022-04-16T15:02:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Use the target tokenizer in text generation pipeline

4da446c

AmitMY mentioned this pull request Mar 10, 2022

Text Generation Pipeline not using Target Tokenizer #16050

Closed

github-actions Bot closed this Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the target tokenizer in text generation pipeline#16049

Use the target tokenizer in text generation pipeline#16049
AmitMY wants to merge 1 commit intohuggingface:mainfrom
AmitMY:translation-decoder

AmitMY commented Mar 10, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2022

Uh oh!

Narsil commented Mar 10, 2022

Uh oh!

patil-suraj commented Mar 10, 2022

Uh oh!

github-actions Bot commented Apr 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AmitMY commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 10, 2022

Uh oh!

Narsil commented Mar 10, 2022

Uh oh!

patil-suraj commented Mar 10, 2022

Uh oh!

github-actions Bot commented Apr 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AmitMY commented Mar 10, 2022 •

edited

Loading