More flaky generate tests#43713
Conversation
|
cc @ydshieh |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
My 5 cents on the issue: This one is flaky for multimodal LLMs which I believe is because of special |
|
Hmn, interesting! Is there a way we can fix just those models? |
|
Not sure, we'll need a list of still-flaky models and examine what is happening when we 'merge_image_text_features'. I think we can mark them flaky for individual model for now, if we don't want to waste time investigating |
effa29d to
fbecae5
Compare
fbecae5 to
11ab4b6
Compare
|
Closing for now because I think this is covered by #43794 - we'll revisit it if the errors keep happening |
The generate tests that compare prompt lookup or speculative decoding to the base model have an extremely high rate of flakiness, I guess because of inherent non-determinism. The actual generation works, but the test frequently sees divergence from this non-determinism at some point and throws an error.
It'd be cool to make a more reliable version of these tests at some point, but for now I'm just marking them as flaky to clean up the CI!
Example failing job here: https://app.circleci.com/jobs/github/huggingface/transformers/2143338