From ccf29fc0e78b458a32e384fc8f5066ef38d73333 Mon Sep 17 00:00:00 2001 From: Sahil Jain Date: Fri, 21 Mar 2025 16:17:13 -0700 Subject: [PATCH] Updated adding models docs to fix latex rendering errors and fix math Signed-off-by: Sahil Jain --- docs/adding_new_models.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/adding_new_models.md b/docs/adding_new_models.md index e80fbb1a79..c39642ea69 100644 --- a/docs/adding_new_models.md +++ b/docs/adding_new_models.md @@ -8,7 +8,7 @@ In on-policy RL, we sample tokens (actions) from the latest version of the polic As an example, we would see errors in naive KL estimation: -$$\text{KL} = \mathbb{E}_{x \sim \pi}[\pi(x) - \pi_{\text{ref}}(x)]$$ +$$\text{KL} = E_{x \sim \pi}[\pi(x) - \pi_{\text{ref}}(x)]$$ When summed/integrated, replacing the $x \sim \pi$ with $x \sim \pi_{\text{wrong}}$ leads to an error of: @@ -17,10 +17,12 @@ $$\sum_{x} \left( \pi(x) - \pi_{\text{ref}}(x) \right) \left( \pi_{\text{wrong}} So, to verify correctness, we calculate $$ -\frac{1}{n}\sum_{i=1}^{n}\exp\left(\left\|\text{lp_train_fwk}_i - \text{lp_infer_fwk}_i\right\|\right) +\frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{logprobs-train-fwk}_i - \text{logprobs-sampling-fwk}_i\right\|\right) $$ -as a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the sampling framework could lack distribution support and we wouldn't catch it here) +where samples are drawn as $x \sim \pi_{\text{sampling-framework}}$ + +as a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the sampling framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{sampling-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient. ## Understanding Discrepancies Between Backends @@ -118,4 +120,4 @@ When validating your model, you should analyze the results across different conf --- -By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-Reinforcer**'s requirements. +By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-Reinforcer**'s requirements. \ No newline at end of file