Fix GRPO VLM prompt handling for string prompts by akshan-main · Pull Request #5064 · huggingface/trl

akshan-main · 2026-02-10T19:44:09Z

Related: #5041

What does this PR do?

When prompts are strings (produced by processor.apply_chat_template), GRPOTrainer unconditionally passes them to prepare_multimodal_messages whenever images are present. That function expects a list of {role, content} dicts, so it crashes with TypeError: string indices must be integers.

This PR adds an isinstance(prompt, list) guard so prepare_multimodal_messages is only called on conversational prompts. String prompts pass through unchanged.

Note: This fixes the immediate crash. Other VLM-related issues (bf16/float mismatch, reward function errors) mentioned in #5041 are separate bugs.

This is PR closes those three issues now

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec · 2026-02-10T20:35:50Z

thanks for the fix!

…ling Fixes huggingface#4746 Fixes huggingface#4870 Fixes huggingface#4451 Fixes huggingface#5041

akshan-main · 2026-02-10T21:02:40Z

Fixes three VLM GRPO crashes in GRPOTrainer._generate_and_score_completions and _calculate_rewards:

String prompt TypeError: When prompts are strings (from processor.apply_chat_template), prepare_multimodal_messages was called unconditionally when images were present. It expects a list of {role, content} dicts, so string prompts crash with TypeError: string indices must be integers. Added an isinstance(prompt, list) guard.
pixel_values dtype mismatch: pixel_values from the processor remain float32 after _prepare_inputs, but the model may be in bf16/fp16. This causes RuntimeError: expected scalar type BFloat16 but found Float in torch.layer_norm. Fixed by casting floating-point tensors in forward_kwargs to the compute dtype when bf16=True or fp16=True.
Reward function exception handling: Sync and async reward functions had no exception handling — any error killed training. Wrapped calls in try/except, assigning NaN rewards on failure with a warning. Consistent with existing None→NaN handling.

Fixes #4746
Fixes #4870
Fixes #4451

Related #5041 , this issue asks for complete reliability which I'm not sure of because if exact issue can be raised I can try fixing it but I can guarantee the three issues along with what was raised in the comments of those issues are fixed reliably and tested out too.

Tests are written too, please review

qgallouedec · 2026-02-10T21:12:48Z

after review, it seems like the fix would just make the training silently ignore the images. The issue actually comes from the fact that the user (and the notebook) tries to apply the chat template before themselves, instead of letting the trainer do it. I'll add a more explicit error message.

qgallouedec · 2026-02-10T21:18:20Z

Fixes three VLM GRPO

it's easier to review when we have 1 PR per item. Currently

3. Reward function exception handling: Sync and async reward functions had no exception handling — any error killed training. Wrapped calls in try/except, assigning NaN rewards on failure with a warning. Consistent with existing None→NaN handling.

I'd be opposed to this change. It's not up to the trainer to handle reward failure. The user can still wrap its reward function like this:

import random


def maybe_float_maybe_fail():
    if random.random() < 0.1:
        raise Exception()
    else:
        return random.random()


def reward_func_before(completion_ids, **kwargs):
    return [maybe_float_maybe_fail() for _ in completion_ids]


def reward_func_after(completion_ids, **kwargs):
    rewards = []
    for _ in completion_ids:
        try:
            reward = maybe_float_maybe_fail()
        except Exception:
            reward = None
        rewards.append(reward)
    return rewards

qgallouedec · 2026-02-10T21:22:20Z

for 1, please see #5067

akshan-main force-pushed the fix_grpo-vlm-prompt-type branch from 14b957e to 53a697f Compare February 10, 2026 20:30

Fix GRPO VLM crashes: prompt type, pixel dtype, and reward error hand…

7015d50

…ling Fixes huggingface#4746 Fixes huggingface#4870 Fixes huggingface#4451 Fixes huggingface#5041

akshan-main force-pushed the fix_grpo-vlm-prompt-type branch from 53a697f to 7015d50 Compare February 10, 2026 20:58

qgallouedec mentioned this pull request Feb 10, 2026

Add validation for conversational prompts in multimodal training #5067

Merged

akshan-main mentioned this pull request Feb 11, 2026

Cast multimodal forward_kwargs to compute dtype for bf16/fp16 training #5073

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GRPO VLM prompt handling for string prompts#5064

Fix GRPO VLM prompt handling for string prompts#5064
akshan-main wants to merge 1 commit intohuggingface:mainfrom
akshan-main:fix_grpo-vlm-prompt-type

akshan-main commented Feb 10, 2026

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

akshan-main commented Feb 10, 2026 •

edited

Loading

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

akshan-main commented Feb 10, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

akshan-main commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

qgallouedec commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

akshan-main commented Feb 10, 2026 •

edited

Loading