Use the target tokenizer in text generation pipeline#16049
Use the target tokenizer in text generation pipeline#16049AmitMY wants to merge 1 commit intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Hi @AmitMY , Thanks for this PR ! It would be nice to get it working with Marian correctly. Unfortunately, this approach cannot work inside within pipelines.
Here I can think of a better approach if we do indeed need two very distinct tokenizers (it seems to be the case, but the solution for #15946 might change the direction). The way the pipeline is set, we would need to define two tokenizer, a By default since most models use the same tokenizer, we could just set both variables to the same object and that should solve the issue at hand. Before jumping to coding this, I think we should fix the first issue at hand #16050. Btw, in that issue you mention Mbart, but in the code you modify Marian. And a note for readers: there's also an open issue to support fast Marian tokenizers #15982 which could end up being linked to this work. |
|
Not reviewing but commenting for information.
@Narsil transformers/src/transformers/tokenization_utils_base.py Lines 3405 to 3411 in e66743e
Right now I think there are only two models that support two vocabs: FSMT and Marian (#15831), but for these two tokenizers it's not necessary to call |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
A subset of #15946
This PR addresses the issue with pipelines in decoding text of different target side tokenizer
This is split from the main PR because this affects the huggingface model hub and API, while the other code is applicable to run in a local copy