When custom stop strings are used in HFPolicy, the generation length computed by fsdp1_policy_worker.py#L716-L721 always treats the first padded EOS token as a generated EOS token. For a batch of two samples
[token1, stop token, padded EOS, padded EOS]
[token2, token3, token4, token5]
The current implementation will compute generation lengths as 3 and 4 respectively. The ground truths are 2 and 4 respectively.
This issue mainly affects run_multi_turn_rollout with HFPolicy and batch size > 1. Some potential fixes I can think of:
- Using different tokens for padding and EOS in HFPolicy.
- Computing the generation lengths based on non-zero logprobs (rejected by @SahilJain314 ).
When custom stop strings are used in HFPolicy, the generation length computed by fsdp1_policy_worker.py#L716-L721 always treats the first padded EOS token as a generated EOS token. For a batch of two samples
The current implementation will compute generation lengths as 3 and 4 respectively. The ground truths are 2 and 4 respectively.
This issue mainly affects
run_multi_turn_rolloutwith HFPolicy and batch size > 1. Some potential fixes I can think of: