about reward or ppl curve

Thanks for your insteresting work!! I think, this is a meaningful idea for general, non-verifiable scenarios.

Could you please share how your reward or perplexity (PPL) curve changes during training? I believe this will be beneficial for me to grasp the performance of your method.