From ccf29fc0e78b458a32e384fc8f5066ef38d73333 Mon Sep 17 00:00:00 2001
From: Sahil Jain <sahilj@nvidia.com>
Date: Fri, 21 Mar 2025 16:17:13 -0700
Subject: [PATCH] Updated adding models docs to fix latex rendering errors and
 fix math

Signed-off-by: Sahil Jain <sahilj@nvidia.com>
---
 docs/adding_new_models.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/adding_new_models.md b/docs/adding_new_models.md
index e80fbb1a79..c39642ea69 100644
--- a/docs/adding_new_models.md
+++ b/docs/adding_new_models.md
@@ -8,7 +8,7 @@ In on-policy RL, we sample tokens (actions) from the latest version of the polic
 
 As an example, we would see errors in naive KL estimation:
 
-$$\text{KL} = \mathbb{E}_{x \sim \pi}[\pi(x) - \pi_{\text{ref}}(x)]$$  
+$$\text{KL} = E_{x \sim \pi}[\pi(x) - \pi_{\text{ref}}(x)]$$  
 
 When summed/integrated, replacing the $x \sim \pi$ with $x \sim \pi_{\text{wrong}}$ leads to an error of:
 
@@ -17,10 +17,12 @@ $$\sum_{x} \left( \pi(x) - \pi_{\text{ref}}(x) \right) \left( \pi_{\text{wrong}}
 So, to verify correctness, we calculate
 
 $$
-\frac{1}{n}\sum_{i=1}^{n}\exp\left(\left\|\text{lp_train_fwk}_i - \text{lp_infer_fwk}_i\right\|\right)
+\frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{logprobs-train-fwk}_i - \text{logprobs-sampling-fwk}_i\right\|\right)
 $$
 
-as a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the sampling framework could lack distribution support and we wouldn't catch it here)
+where samples are drawn as $x \sim \pi_{\text{sampling-framework}}$
+
+as a measure of multiplicative probability error for sampled tokens. Note that this is not exhaustive (the sampling framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{sampling-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient.
 
 ## Understanding Discrepancies Between Backends
 
@@ -118,4 +120,4 @@ When validating your model, you should analyze the results across different conf
 
 ---
 
-By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-Reinforcer**'s requirements.
+By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets **NeMo-Reinforcer**'s requirements.
\ No newline at end of file