Skip to content

Fix GRPO VLM prompt handling for string prompts#5064

Open
akshan-main wants to merge 1 commit intohuggingface:mainfrom
akshan-main:fix_grpo-vlm-prompt-type
Open

Fix GRPO VLM prompt handling for string prompts#5064
akshan-main wants to merge 1 commit intohuggingface:mainfrom
akshan-main:fix_grpo-vlm-prompt-type

Conversation

@akshan-main
Copy link

Fixes #4746
Fixes #4870
Fixes #4451

Related: #5041

What does this PR do?

When prompts are strings (produced by processor.apply_chat_template), GRPOTrainer unconditionally passes them to prepare_multimodal_messages whenever images are present. That function expects a list of {role, content} dicts, so it crashes with TypeError: string indices must be integers.

This PR adds an isinstance(prompt, list) guard so prepare_multimodal_messages is only called on conversational prompts. String prompts pass through unchanged.

Note: This fixes the immediate crash. Other VLM-related issues (bf16/float mismatch, reward function errors) mentioned in #5041 are separate bugs.

This is PR closes those three issues now

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@akshan-main akshan-main force-pushed the fix_grpo-vlm-prompt-type branch from 14b957e to 53a697f Compare February 10, 2026 20:30
@qgallouedec
Copy link
Member

thanks for the fix!

@akshan-main akshan-main force-pushed the fix_grpo-vlm-prompt-type branch from 53a697f to 7015d50 Compare February 10, 2026 20:58
@akshan-main
Copy link
Author

akshan-main commented Feb 10, 2026

Fixes three VLM GRPO crashes in GRPOTrainer._generate_and_score_completions and _calculate_rewards:

  1. String prompt TypeError: When prompts are strings (from processor.apply_chat_template), prepare_multimodal_messages was called unconditionally when images were present. It expects a list of {role, content} dicts, so string prompts crash with TypeError: string indices must be integers. Added an isinstance(prompt, list) guard.

  2. pixel_values dtype mismatch: pixel_values from the processor remain float32 after _prepare_inputs, but the model may be in bf16/fp16. This causes RuntimeError: expected scalar type BFloat16 but found Float in torch.layer_norm. Fixed by casting floating-point tensors in forward_kwargs to the compute dtype when bf16=True or fp16=True.

  3. Reward function exception handling: Sync and async reward functions had no exception handling — any error killed training. Wrapped calls in try/except, assigning NaN rewards on failure with a warning. Consistent with existing None→NaN handling.

Fixes #4746
Fixes #4870
Fixes #4451

Related #5041 , this issue asks for complete reliability which I'm not sure of because if exact issue can be raised I can try fixing it but I can guarantee the three issues along with what was raised in the comments of those issues are fixed reliably and tested out too.

Tests are written too, please review

@qgallouedec
Copy link
Member

after review, it seems like the fix would just make the training silently ignore the images. The issue actually comes from the fact that the user (and the notebook) tries to apply the chat template before themselves, instead of letting the trainer do it. I'll add a more explicit error message.

@qgallouedec
Copy link
Member

Fixes three VLM GRPO

it's easier to review when we have 1 PR per item. Currently

3. Reward function exception handling: Sync and async reward functions had no exception handling — any error killed training. Wrapped calls in try/except, assigning NaN rewards on failure with a warning. Consistent with existing None→NaN handling.

I'd be opposed to this change. It's not up to the trainer to handle reward failure. The user can still wrap its reward function like this:

import random


def maybe_float_maybe_fail():
    if random.random() < 0.1:
        raise Exception()
    else:
        return random.random()


def reward_func_before(completion_ids, **kwargs):
    return [maybe_float_maybe_fail() for _ in completion_ids]


def reward_func_after(completion_ids, **kwargs):
    rewards = []
    for _ in completion_ids:
        try:
            reward = maybe_float_maybe_fail()
        except Exception:
            reward = None
        rewards.append(reward)
    return rewards

@qgallouedec
Copy link
Member

for 1, please see #5067

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants

Comments