Add labels padding in tokenization_utils_base.py#8116
Add labels padding in tokenization_utils_base.py#8116changjonathanc wants to merge 1 commit intohuggingface:masterfrom
Conversation
|
Hi there! Thanks for your PR! I see a few problems with this approach.
I think the proper fix is to create an option in |
|
Thanks for the reply! Considering that different problem may pad labels differently, I think maybe it's better to leave it as is and use this: class MyDataCollatorWithPadding(DataCollatorWithPadding):
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
batch = super().__call__(features)
# add custom label padding here
return batchJust came up with this. 😃 Not sure if it works. |
|
Just tried it, the above code does not work, because the error is in Therefore
Maybe we will need a |
|
I think you should use the newly pushed DataCollatorForTokenClassification from #8274. |
|
Very nice! I guess I will close this PR. |
What does this PR do?
This PR makes
tokenizer.pad()also pad'labels'.I tried to use this:
transformers/src/transformers/data/data_collator.py
Line 69 in 8065fea
But since labels is not padded, the result cannot turn into a tensor.
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same lengt h.This patch solves the problem.
It seems logical to me that
tokenizer.pad()should also pad'labels'.This portion of code is last changed in #4015 @n1t0 @thomwolf @LysandreJik