[WIP] seq2seq example#3402
[WIP] seq2seq example#3402mgoldey wants to merge 14 commits intohuggingface:clean_encoder_decocer_modelingfrom
Conversation
|
As a non-blocking question, I do note that a lot of the examples use argparse to parse comparatively long lists of arguments. I've maintained the extant style in this PR to avoid causing noise and confusion Would it be acceptable if I broke with this style to use a JSON file to store all the arguments for an experiment? |
|
That sounds good. I'm still tweaking things on my end for accuracy and improved logic as I get more familiar with the code base here. I'll see if I can rebase of #3383 by then, depending on my other workload. Feel free to reach out via google hangouts if you're comfortable. |
…mbeddings as input to decoder when no decoder_input_ids or decoder_inputs_embeds provided
8e3e8d4 to
b782352
Compare
| features = [] | ||
|
|
||
| cls_token, sep_token, pad_token = cls_token = ( | ||
| tokenizer.cls_token, |
There was a problem hiding this comment.
Hi, I've been using your example to guide my own setup of a seq2seq model (thank you!) Is this line in your pull request a typo? Think it's just supposed to be cls_token, sep_token, pad_token = ( ...
| formatted_tokens += [pad_token_label_id] | ||
| segment_ids += [cls_token_segment_id] | ||
| # gpt2 has no cls_token | ||
| elif model_type not in ["gpt2"]: |
There was a problem hiding this comment.
I believe this is supposed to be
elif model_type in ['gpt2']
instead of "not in"
|
Sorry, to answer only now! I'll will soon add a Encoder-Decoder google colab that shows how to use seq2seq |
|
Thanks - fine to close. We've moved forward without using seq2seq due to poor overall accuracy with the scale of data in place. |
This PR presents an example seq2seq use case and bug fixes necessary for this to execute with reasonable accuracy.
The utils_seq2seq.py file defines the data format for training data, and the run_seq2seq.py file takes training, development, and test data and produces a model. The README.md discusses how to execute this toy problem. The specific toy problem in use here is formatting a date string to the American style, which is a trivial example. On my local setup using GPUs, this example executes within 5 minutes. Production models should include more learnings.
I welcome feedback about how to strengthen performance here and the best route to increase testing.
This relies on a few bug fixes which have been incorporated in this branch
I strongly suspect that the input to the decoder in the PreTrainedEncoderDecoder class is incorrect as present in the code base, and commit 9fcf73a has a proposed fix. It doesn't make sense to have the expected token ids as input to the decoder when the decoder needs to learn how to decode from the embeddings.Incomplete understanding - will fix