Fix for sequences of unequal length #20

fschlatt · 2025-11-04T12:25:41Z

The tokenizer uses right-side padding by default. Since the prompt is concatenated to the input ids, this becomes a problem when unequal input lengths are passed through in a single batch. For example, for the following two input sequences, the padding tokens of the first input sequence are retained.

Query: foo Document: bar Relevant:
Query: foo Document: bar bar bar bar bar bar bar bar Relevant:

The first sequence gets tokenized into:

['▁', 'Query', ':', '▁fo', 'o', '▁Document', ':', '▁bar', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '▁Relevan', 't', ':', '</s>']

Setting the padding side to left circumvents this issue

cmacdonald · 2025-11-04T13:01:35Z

Thanks @fschlatt. do you have any effectiveness numbers on msmarco-passage (before/after)?

fschlatt · 2025-11-04T14:13:58Z

The scores are actually marginally worse (k=100) (using this notebook with slight modifications https://github.com/terrier-org/pyterrier/blob/master/examples/experiments/msmarco_BM25_MonoT5.ipynb)

	name	nDCG@10
0	BM25	0.47954
1	BM25 >> monoT5	0.700796
2	BM25 >> NewMonoT5	0.695354

fschlatt · 2025-11-04T14:20:17Z

Since the attention mask is still fine, I guess the only problem is the additional eos token that gets added, which doesn't seem to throw the model off too much

fschlatt · 2025-11-04T14:42:20Z

Come to think of it, left-side padding might also not be optimal, due to the positional encodings of T5.

cmacdonald · 2025-11-06T15:19:52Z

so this could be related to the training of the MonoT5 model? We're using the original checkpoint of Jimmy et al. here.

fschlatt · 2025-11-06T15:45:31Z

I may have been a bit too brief with my explanations of the issue and the solution etc. I'll try to explain it a bit better:

If I'm not mistaken, the pyterrier monoT5 implementation tries to handle the case when the sequence is too long for the model, which could cause the " ... Relevant:" suffix of the prompt to be cut off. To avoid this, pyterrier T5 pre-encodes the suffix and concatenates it after tokenizing the sequence without the suffix. This can create weird problems when padding tokens are inserted, i.e., when sequences of different lengths are tokenized in a single batch. (Apparently, this is not really a big problem though, since the scores on TREC DL seem to be fine nonetheless). Left-side padding might be able to circumvent this issue, by moving the padding tokens to the very left of the input sequences. The suffix is then correctly appended to the end of the sequence. However, left-side padding may mess with the positional encodings of the underlying T5 model. I am not entirely sure how T5 positional encodings work.

When looking at the original implementation by Lin et al. https://github.com/castorini/pygaggle/blob/master/pygaggle/model/tokenize.py#L103, it seems to me that they do not care about sequences being too long anyway. So the pyterrier T5 implementation might as well skip the whole pre-encoding the suffix step and just pass the "Query: {q} Document: {d} Relevant:" prompt to the tokenizer. Then the model scores would exactly match the scores of the original implementation, if I'm not mistaken

seanmacavaney · 2025-11-06T18:49:14Z

So the pyterrier T5 implementation might as well skip the whole pre-encoding the suffix step and just pass the "Query: {q} Document: {d} Relevant:" prompt to the tokenizer.

I'm not opposed to this -- it should simplify the code a bit. I don't think the prompt is necessary, since the decoder probably un-learned everything except generating true/false.

use left side padding

8bd801b

fschlatt mentioned this pull request Nov 5, 2025

Add failing test between pyterrierT5 and lightning-ir webis-de/lightning-ir#130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for sequences of unequal length #20

Fix for sequences of unequal length #20

Uh oh!

fschlatt commented Nov 4, 2025 •

edited

Loading

Uh oh!

cmacdonald commented Nov 4, 2025

Uh oh!

fschlatt commented Nov 4, 2025 •

edited

Loading

Uh oh!

fschlatt commented Nov 4, 2025 •

edited

Loading

Uh oh!

fschlatt commented Nov 4, 2025

Uh oh!

cmacdonald commented Nov 6, 2025

Uh oh!

fschlatt commented Nov 6, 2025 •

edited

Loading

Uh oh!

seanmacavaney commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix for sequences of unequal length #20

Are you sure you want to change the base?

Fix for sequences of unequal length #20

Uh oh!

Conversation

fschlatt commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmacdonald commented Nov 4, 2025

Uh oh!

fschlatt commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fschlatt commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fschlatt commented Nov 4, 2025

Uh oh!

cmacdonald commented Nov 6, 2025

Uh oh!

fschlatt commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanmacavaney commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fschlatt commented Nov 4, 2025 •

edited

Loading

fschlatt commented Nov 4, 2025 •

edited

Loading

fschlatt commented Nov 4, 2025 •

edited

Loading

fschlatt commented Nov 6, 2025 •

edited

Loading