Question about the output of reward model in RLHF？

Why reward model use `mean(values[:,:-1], dim=1)` as output？
```python
values = self.value_head(last_hidden_states)[:, :-1]
value = values.mean(dim=1).squeeze(1)    # ensure shape is (B)
```
https://github.com/hpcaitech/ColossalAI/blob/d20dceb9a3d1bdcb2376201220f49fca7c7c1be9/applications/Chat/coati/models/base/reward_model.py#L39

The input may be like `<bos_token_id> <question_token_id_1> ... <question_token_id_n> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>`，so I think values should use `self.value_head(last_hidden_states)[:, :index_of_eos + 1]`.   
The index `-1` may represent  pad_token，this output is meaningless.

And I think whether use the output of last token instead of mean value for input sequence?
i.e. `value = self.value_head(last_hidden_states)[:, index_of_eos]`  (for batch_size>1, use torch.gather instead)
rather than 
```
# for batch_size=1
values = self.value_head(last_hidden_states)[:, :index_of_eos + 1]
value = values.mean(dim=1).squeeze(1)    # ensure shape is (B)
```

There is an another strange problem，in RL stage, the input of reward model may be like `<bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>`

The input form is different with that used in reward model training process, and is this the cause of unstable training?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the output of reward model in RLHF？ #4475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the output of reward model in RLHF？ #4475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions