-
Notifications
You must be signed in to change notification settings - Fork 13
Fix for sequences of unequal length #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks @fschlatt. do you have any effectiveness numbers on msmarco-passage (before/after)? |
|
The scores are actually marginally worse (k=100) (using this notebook with slight modifications https://github.com/terrier-org/pyterrier/blob/master/examples/experiments/msmarco_BM25_MonoT5.ipynb)
|
|
Since the attention mask is still fine, I guess the only problem is the additional eos token that gets added, which doesn't seem to throw the model off too much |
|
Come to think of it, left-side padding might also not be optimal, due to the positional encodings of T5. |
|
so this could be related to the training of the MonoT5 model? We're using the original checkpoint of Jimmy et al. here. |
|
I may have been a bit too brief with my explanations of the issue and the solution etc. I'll try to explain it a bit better: If I'm not mistaken, the pyterrier monoT5 implementation tries to handle the case when the sequence is too long for the model, which could cause the " ... Relevant:" suffix of the prompt to be cut off. To avoid this, pyterrier T5 pre-encodes the suffix and concatenates it after tokenizing the sequence without the suffix. This can create weird problems when padding tokens are inserted, i.e., when sequences of different lengths are tokenized in a single batch. (Apparently, this is not really a big problem though, since the scores on TREC DL seem to be fine nonetheless). Left-side padding might be able to circumvent this issue, by moving the padding tokens to the very left of the input sequences. The suffix is then correctly appended to the end of the sequence. However, left-side padding may mess with the positional encodings of the underlying T5 model. I am not entirely sure how T5 positional encodings work. When looking at the original implementation by Lin et al. https://github.com/castorini/pygaggle/blob/master/pygaggle/model/tokenize.py#L103, it seems to me that they do not care about sequences being too long anyway. So the pyterrier T5 implementation might as well skip the whole pre-encoding the suffix step and just pass the "Query: {q} Document: {d} Relevant:" prompt to the tokenizer. Then the model scores would exactly match the scores of the original implementation, if I'm not mistaken |
I'm not opposed to this -- it should simplify the code a bit. I don't think the prompt is necessary, since the decoder probably un-learned everything except generating true/false. |
The tokenizer uses right-side padding by default. Since the prompt is concatenated to the input ids, this becomes a problem when unequal input lengths are passed through in a single batch. For example, for the following two input sequences, the padding tokens of the first input sequence are retained.
The first sequence gets tokenized into:
Setting the padding side to left circumvents this issue