Skip to content

Use the target tokenizer in text generation pipeline#16049

Closed
AmitMY wants to merge 1 commit intohuggingface:mainfrom
AmitMY:translation-decoder
Closed

Use the target tokenizer in text generation pipeline#16049
AmitMY wants to merge 1 commit intohuggingface:mainfrom
AmitMY:translation-decoder

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Mar 10, 2022

What does this PR do?

A subset of #15946
This PR addresses the issue with pipelines in decoding text of different target side tokenizer

This is split from the main PR because this affects the huggingface model hub and API, while the other code is applicable to run in a local copy

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@Narsil
Copy link
Copy Markdown
Contributor

Narsil commented Mar 10, 2022

Hi @AmitMY ,

Thanks for this PR ! It would be nice to get it working with Marian correctly.

Unfortunately, this approach cannot work inside within pipelines.

.as_target_tokenizer() does not exist in most tokenizers, and therefore cannot be used (pipeline code is very generic), It's also mutating internal state which is likely going to cause issues in threaded/multiprocessing context.

Here I can think of a better approach if we do indeed need two very distinct tokenizers (it seems to be the case, but the solution for #15946 might change the direction).

The way the pipeline is set, we would need to define two tokenizer, a preprocess_tokenizer and a postprocess_tokenizer (or source/target). Each would be unique, and be used accordingly.

By default since most models use the same tokenizer, we could just set both variables to the same object and that should solve the issue at hand.

Before jumping to coding this, I think we should fix the first issue at hand #16050.

Btw, in that issue you mention Mbart, but in the code you modify Marian.

And a note for readers: there's also an open issue to support fast Marian tokenizers #15982 which could end up being linked to this work.

@patil-suraj
Copy link
Copy Markdown
Contributor

Not reviewing but commenting for information.

.as_target_tokenizer() does not exist in most tokenizers, and therefore cannot be used

@Narsil as_target_tokenizer defined in PreTrainedTokenizerBase so it should be available for all tokenizers. By default it's a no op. cf

def as_target_tokenizer(self):
"""
Temporarily sets the tokenizer for encoding the targets. Useful for tokenizer associated to
sequence-to-sequence models that need a slightly different processing for the labels.
"""
yield

By default since most models use the same tokenizer, we could just set both variables to the same object and that should solve the issue at hand.

Right now I think there are only two models that support two vocabs: FSMT and Marian (#15831), but for these two tokenizers it's not necessary to call as_target_tokenizer since they use the target tokenizer in .decode/batch_decode

@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions Bot closed this Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants