[examples/seq2seq] support label smoothing#9844
[examples/seq2seq] support label smoothing#9844patil-suraj merged 6 commits intohuggingface:masterfrom
Conversation
sgugger
left a comment
There was a problem hiding this comment.
Thanks a lot for doing this! I like it a lot!
I don't know if the shift methods are used for something else in the seq2seq methods, but if this was their only use, we could maybe deprecate them?
those are used for exactly the same reason, |
patrickvonplaten
left a comment
There was a problem hiding this comment.
LGTM!
One thing, I'd change however would be to not allow to pass the tokenizer.pad_token_id to prepare_decoder_input_ids_from_labels => think the model should always have it's own pad_token_id defined in the config. What do you think?
|
I agree we could remove the |
There was a problem hiding this comment.
Just realized this, here labels is a list but we need a tensor for prepare_decoder_input_ids_from_labels method. And can't turn that into a tensor here since it's not padded.
Thinking more about this, I would say we remove the ignore_pad_token_for_loss argument, this is rather a confusing name. In the previous script, it was used to specify that we should ignore pad tokens as it is (by setting the ignore_index in loss to pad_token_id ) rather than replacing them with -100 (because of the label smoothing issues). So in any case the pad tokens were ignored, and I don't think there is any reason to not ignore them.
So IMO we should
- remove the
ignore_pad_token_for_lossargument - set
paddingtoTruewhenpad_to_max_lengthis False - replace pad with -100 in the pre-processing function
- and remove/deprecate the
DataCollatorForSeq2Seq
There was a problem hiding this comment.
set padding to True when pad_to_max_length is False
This will pad to the maximum length of the batches sent to the map method of the dataset, not the maximum length of a batch, so this does not work. DataCollatorForSeqSeq is needed until we have the datasets v2 release to do that padding as a transform on the dataset (soon!) but we can't deprecate it just now.
There was a problem hiding this comment.
Aah, yes. Then do you think we could pass the prepare_decoder_input_ids_from_labels method to DataCollatorForSeq2Seq and prepare the decoder_input_ids there ?
There was a problem hiding this comment.
I'm happy with that solution, yes. We can pass the model to DataCollatorForSeq2Seq and do the check for the method there (better to pass the object than a function).
There was a problem hiding this comment.
Hey @sgugger , @patil-suraj
I am trying to run "run_seq2seq.py" to train mT5 for translation task. I am getting the following error though:
ImportError: cannot import name 'DataCollatorForSeq2Seq' from 'transformers' (unknown location)
It seems that DataCollatorForSeq2Seq is already removed from the transformers package, right?
P.S. my transformers version: 4.2.2 installed using pip
There was a problem hiding this comment.
Hey @Arman-IMRSV,
please use issues to report bugs
2984ca4 to
bc9fee8
Compare
sgugger
left a comment
There was a problem hiding this comment.
Thanks a lot for the work here!
What does this PR do?
Add support for label smoothing by adding
prepare_decoder_input_ids_from_labelsmethod to all seq2seq models which will let us preparedecoder_input_idsoutside the model.For context, we need to pass
decoder_input_idsfor label smoothing because we don't passlabelsto avoid calculating loss twice, which leads to speeds degradation, see #9713.@sgugger , @patrickvonplaten what do we think about adding
prepare_decoder_input_ids_from_labelsto every seq2seq model, there are alreadyshift_tokens_right/_shift_rightmethods, but the name is a bit confusing IMO to use outside the model.