Skip to content

Question about the output of reward model in RLHF? #4475

@gauss-clb

Description

@gauss-clb

Why reward model use mean(values[:,:-1], dim=1) as output?

values = self.value_head(last_hidden_states)[:, :-1]
value = values.mean(dim=1).squeeze(1)    # ensure shape is (B)

values = self.value_head(last_hidden_states)[:, :-1]

The input may be like <bos_token_id> <question_token_id_1> ... <question_token_id_n> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>,so I think values should use self.value_head(last_hidden_states)[:, :index_of_eos + 1].
The index -1 may represent pad_token,this output is meaningless.

And I think whether use the output of last token instead of mean value for input sequence?
i.e. value = self.value_head(last_hidden_states)[:, index_of_eos] (for batch_size>1, use torch.gather instead)
rather than

# for batch_size=1
values = self.value_head(last_hidden_states)[:, :index_of_eos + 1]
value = values.mean(dim=1).squeeze(1)    # ensure shape is (B)

There is an another strange problem,in RL stage, the input of reward model may be like <bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>

The input form is different with that used in reward model training process, and is this the cause of unstable training?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions