Direct Preference Optimization (DPO) style rewards by opentaco · Pull Request #99 · opentensor/validators

opentaco · 2023-07-20T18:41:27Z

Direct Preference Optimization (DPO) style rewards

Calculates a direct preference optimization (DPO) style reward for a completion, which is a reference model's average log-probability for completion tokens given a prompt.

Uses guidance from https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py.

steffencruz

Interesting changes. Looking forward to seeing the experiments

opentaco added 6 commits July 20, 2023 19:46

Add DirectPreferenceRewardModel

baf42de

Maint comment update

8fe9acc

Support Llama-2

01f0c3f

Update token handling

4b2d7fe

Update requirements

59b7ec0

Update requirements

d1e1f9f

steffencruz reviewed Jul 20, 2023

View reviewed changes

opentaco and others added 5 commits July 21, 2023 17:00

Add trace

e8f7559

Handle nan/inf

7e8a0d0

Use BTLM and check max seq length

2996a22

Merge branch 'staging' into feature/dpo-rewards

4d9ef17

Merge branch 'staging' into feature/dpo-rewards

c1f0cc7

opentaco marked this pull request as ready for review August 24, 2023 08:36

opentaco requested a review from Eugene-hu August 24, 2023 08:37

Add DPO to test

e1519d1

Eugene-hu approved these changes Aug 24, 2023

View reviewed changes

opentaco merged commit 6286e9c into staging Aug 24, 2023

opentaco deleted the feature/dpo-rewards branch August 24, 2023 15:32

p-ferreira mentioned this pull request Aug 28, 2023

V.1.2.0 Release #142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct Preference Optimization (DPO) style rewards#99

Direct Preference Optimization (DPO) style rewards#99
opentaco merged 12 commits intostagingfrom
feature/dpo-rewards

opentaco commented Jul 20, 2023

Uh oh!

steffencruz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

opentaco commented Jul 20, 2023