Skip to content

More flaky generate tests#43713

Closed
Rocketknight1 wants to merge 2 commits intomainfrom
more_flaky_tests
Closed

More flaky generate tests#43713
Rocketknight1 wants to merge 2 commits intomainfrom
more_flaky_tests

Conversation

@Rocketknight1
Copy link
Copy Markdown
Member

@Rocketknight1 Rocketknight1 commented Feb 3, 2026

The generate tests that compare prompt lookup or speculative decoding to the base model have an extremely high rate of flakiness, I guess because of inherent non-determinism. The actual generation works, but the test frequently sees divergence from this non-determinism at some point and throws an error.

It'd be cool to make a more reliable version of these tests at some point, but for now I'm just marking them as flaky to clean up the CI!

Example failing job here: https://app.circleci.com/jobs/github/huggingface/transformers/2143338

@Rocketknight1 Rocketknight1 marked this pull request as ready for review February 3, 2026 15:51
@Rocketknight1
Copy link
Copy Markdown
Member Author

cc @ydshieh

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Copy Markdown
Member Author

@zucchini-nlp
Copy link
Copy Markdown
Member

My 5 cents on the issue:

This one is flaky for multimodal LLMs which I believe is because of special multimodal tokens. For most VLMs we fixed it by this line which adds force_no_generate_tokens. So I believe Kosmos2/GraniteSpeech have different namings for those special tokens and we're back to flakiness

# The added line
logits_processor_kwargs = self._get_logits_processor_kwargs(config=model.config)

@Rocketknight1
Copy link
Copy Markdown
Member Author

Hmn, interesting! Is there a way we can fix just those models?

@zucchini-nlp
Copy link
Copy Markdown
Member

Not sure, we'll need a list of still-flaky models and examine what is happening when we 'merge_image_text_features'. I think we can mark them flaky for individual model for now, if we don't want to waste time investigating

@Rocketknight1
Copy link
Copy Markdown
Member Author

Closing for now because I think this is covered by #43794 - we'll revisit it if the errors keep happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants