Why reward model use mean(values[:,:-1], dim=1) as output?
values = self.value_head(last_hidden_states)[:, :-1]
value = values.mean(dim=1).squeeze(1) # ensure shape is (B)
|
values = self.value_head(last_hidden_states)[:, :-1] |
The input may be like <bos_token_id> <question_token_id_1> ... <question_token_id_n> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>,so I think values should use self.value_head(last_hidden_states)[:, :index_of_eos + 1].
The index -1 may represent pad_token,this output is meaningless.
And I think whether use the output of last token instead of mean value for input sequence?
i.e. value = self.value_head(last_hidden_states)[:, index_of_eos] (for batch_size>1, use torch.gather instead)
rather than
# for batch_size=1
values = self.value_head(last_hidden_states)[:, :index_of_eos + 1]
value = values.mean(dim=1).squeeze(1) # ensure shape is (B)
There is an another strange problem,in RL stage, the input of reward model may be like <bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>
The input form is different with that used in reward model training process, and is this the cause of unstable training?
Why reward model use
mean(values[:,:-1], dim=1)as output?ColossalAI/applications/Chat/coati/models/base/reward_model.py
Line 39 in d20dceb
The input may be like
<bos_token_id> <question_token_id_1> ... <question_token_id_n> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>,so I think values should useself.value_head(last_hidden_states)[:, :index_of_eos + 1].The index
-1may represent pad_token,this output is meaningless.And I think whether use the output of last token instead of mean value for input sequence?
i.e.
value = self.value_head(last_hidden_states)[:, index_of_eos](for batch_size>1, use torch.gather instead)rather than
There is an another strange problem,in RL stage, the input of reward model may be like
<bos_token_id> <question_token_id_1> ... <question_token_id_n> <pad_token_id> ... <pad_token_id> <answer_token_id_1> ... <answer_token_id_m> <eos_token_id> <pad_token_id> ... <pad_token_id>The input form is different with that used in reward model training process, and is this the cause of unstable training?