Skip to content

Conversation

@fschlatt
Copy link

@fschlatt fschlatt commented Nov 4, 2025

The tokenizer uses right-side padding by default. Since the prompt is concatenated to the input ids, this becomes a problem when unequal input lengths are passed through in a single batch. For example, for the following two input sequences, the padding tokens of the first input sequence are retained.

Query: foo Document: bar Relevant:
Query: foo Document: bar bar bar bar bar bar bar bar Relevant:

The first sequence gets tokenized into:

['▁', 'Query', ':', '▁fo', 'o', '▁Document', ':', '▁bar', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '▁Relevan', 't', ':', '</s>']

Setting the padding side to left circumvents this issue

@cmacdonald
Copy link
Contributor

Thanks @fschlatt. do you have any effectiveness numbers on msmarco-passage (before/after)?

@fschlatt
Copy link
Author

fschlatt commented Nov 4, 2025

The scores are actually marginally worse (k=100) (using this notebook with slight modifications https://github.com/terrier-org/pyterrier/blob/master/examples/experiments/msmarco_BM25_MonoT5.ipynb)

name nDCG@10
0 BM25 0.47954
1 BM25 >> monoT5 0.700796
2 BM25 >> NewMonoT5 0.695354

@fschlatt
Copy link
Author

fschlatt commented Nov 4, 2025

Since the attention mask is still fine, I guess the only problem is the additional eos token that gets added, which doesn't seem to throw the model off too much

@fschlatt
Copy link
Author

fschlatt commented Nov 4, 2025

Come to think of it, left-side padding might also not be optimal, due to the positional encodings of T5.

@cmacdonald
Copy link
Contributor

so this could be related to the training of the MonoT5 model? We're using the original checkpoint of Jimmy et al. here.

@fschlatt
Copy link
Author

fschlatt commented Nov 6, 2025

I may have been a bit too brief with my explanations of the issue and the solution etc. I'll try to explain it a bit better:

If I'm not mistaken, the pyterrier monoT5 implementation tries to handle the case when the sequence is too long for the model, which could cause the " ... Relevant:" suffix of the prompt to be cut off. To avoid this, pyterrier T5 pre-encodes the suffix and concatenates it after tokenizing the sequence without the suffix. This can create weird problems when padding tokens are inserted, i.e., when sequences of different lengths are tokenized in a single batch. (Apparently, this is not really a big problem though, since the scores on TREC DL seem to be fine nonetheless). Left-side padding might be able to circumvent this issue, by moving the padding tokens to the very left of the input sequences. The suffix is then correctly appended to the end of the sequence. However, left-side padding may mess with the positional encodings of the underlying T5 model. I am not entirely sure how T5 positional encodings work.

When looking at the original implementation by Lin et al. https://github.com/castorini/pygaggle/blob/master/pygaggle/model/tokenize.py#L103, it seems to me that they do not care about sequences being too long anyway. So the pyterrier T5 implementation might as well skip the whole pre-encoding the suffix step and just pass the "Query: {q} Document: {d} Relevant:" prompt to the tokenizer. Then the model scores would exactly match the scores of the original implementation, if I'm not mistaken

@seanmacavaney
Copy link
Collaborator

So the pyterrier T5 implementation might as well skip the whole pre-encoding the suffix step and just pass the "Query: {q} Document: {d} Relevant:" prompt to the tokenizer.

I'm not opposed to this -- it should simplify the code a bit. I don't think the prompt is necessary, since the decoder probably un-learned everything except generating true/false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants