[BUG]: wrong in KL_approx , the order of two distribution is handled incorrectly

compute_approx_kl for `NaiveExperienceMaker` maybe incorrect. 

As motion in [Approximating KL Divergence](http://joschu.net/blog/kl-approx.html)

 
$$ KL[q,p] = \mathbb{E}_{x\sim q}[\log\frac{q(x)}{p(x)}] $$

let 

$$ r = \frac{p(x)}{q(x)} $$

note that, x is sample from distribution q. 

Then 

$$ KL_{approx}[q,p] = \mathbb{E}_{x\sim q}[-\log(r) + (r-1) ] $$

---
 
In paper [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155), object for actor , (e.i. reward of experience , ignore loss_ptx) 

<img width="756" alt="image" src="https://user-images.githubusercontent.com/22851737/236610916-2f068c34-1508-438f-bd16-6fe6ed491e8c.png">


<img width="795" alt="image" src="https://user-images.githubusercontent.com/22851737/236611424-17600f6e-7aca-4bdf-95bd-2ce4035bcd3a.png">

So for computing KL, samples are sampled from actor model e.i  $\pi^{RL}_\phi$, instead of $\pi^{SFT}$

 KL in the object should be $KL[\pi^{RL}, \pi^{SFT}] =KL[q,p]$ , and $r$ of KL_approx should be $\frac{\pi^{SFT}(x)}{\pi^{RL}_\phi(x)}$

--- 

While on the `coati.models.utils.compute_approx_kl`

``` python 
    log_ratio = log_probs - log_probs_base
```

and log_probs and log_probs_base correspond to actor_model and sft_model respectively.
This should be modify to 

```python 
    log_ratio = log_probs_base - log_probs 
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: wrong in KL_approx , the order of two distribution is handled incorrectly #3696

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: wrong in KL_approx , the order of two distribution is handled incorrectly #3696

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions