Thanks for your insteresting work!! I think, this is a meaningful idea for general, non-verifiable scenarios.
Could you please share how your reward or perplexity (PPL) curve changes during training? I believe this will be beneficial for me to grasp the performance of your method.
Thanks for your insteresting work!! I think, this is a meaningful idea for general, non-verifiable scenarios.
Could you please share how your reward or perplexity (PPL) curve changes during training? I believe this will be beneficial for me to grasp the performance of your method.