[Experimental] Add SDFT trainer, config, docs, and tests by Shekswess · Pull Request #4941 · huggingface/trl

Shekswess · 2026-01-31T14:24:48Z

What does this PR do?

Adds an experimental Self‑Distillation Fine‑Tuning (SDFT) trainer to TRL, including:

SDFTTrainer + SDFTConfig under trl.experimental.sdft
strict dataset validation for prompt/teacher_prompt
teacher handling (explicit or defaulted from student checkpoint)
docs page + toctree entry
unit tests (init checks + low‑priority training smoke test)

Fixes #4940

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@qgallouedec

Shekswess · 2026-01-31T14:29:14Z

@qgallouedec maybe this is not the perfect implementation but overall I think it's okay. I followed the original code from the authors of the papers and that was kinda messy hahahahaha

Please any comments on how we can make this in the best shape possible, improvements, coverage, etc... feel free to drop it and I can help you on this one. This is my first PR like this so I'm really excited. ❤️

P.S I want this trainer to be added as experimental trainer because I want to do active research on self-distillation methods on tiny language models I see it as a possibility to make them even more powerful

jonhue · 2026-02-01T20:07:36Z

Hello @Shekswess, one of the authors here 👋
I had a quick glance over the code & it looks good. Thanks so much for making this PR!!

Currently, this implementation is for offline training (ie training on a fixed dataset of teacher prompts). I was wondering whether we could easily extend this implementation to online training too?
This is what we did in our other paper: https://github.com/lasgroup/SDPO. Unfortunately, our implementation for this is in verl, but it is 1to1 the same algorithm. The only difference is where the teacher prompts come from. In SDPO, the teacher prompts are created "online" using generated trajectories that are marked as correct by the environment and any other rich signal returned by the environment.

Do you think it would make sense to integrate these into one implementation of self-distillation?

Shekswess · 2026-02-02T00:07:53Z

Heyoooo @jonhue !
First of all really awesome job with this idea, paper and everything you've done guys in general with this approach. I love it and I cannot wait to test this on tiny language models. About the implementation I think we can actually do "online" version of this. Before starting anything to modify the code, I would want to consult @qgallouedec because this legend knows a lot about trl (as main contributor) and what could be the best way of implementing this in trl. I can do all the heavy work so no worries on that front. The only help I would require is some advices on how to handle this the best as possible in trl. This would be huge for a lot of research folks because the whole approach looks really promising to me and I cannot wait to get my hands dirty hahahaha.

jonhue · 2026-02-02T06:54:03Z

Amazing! Happy to help!

qgallouedec · 2026-02-18T04:02:41Z

Hey, sorry for the late review, this one is quite big! Thank you!

At this point, I have a few remarks:

I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.
in my understanding, we only generate one completion from the teacher and one from the model, so it's fair to just hard code num_generations=1
in the paper they say "If needed, add importance sampling to compensate for differences between the inference engine (e.g., VLLM) and the training code.". but don't provide guidance on how they do it. I think for the first implementation, we should just completely drop this IS correction, and maybe add it later.
they've been many change in GRPO during the last days/weeks, including tool-calling support which seems to be very important in the paper. I'm trying to integrate all of them in this PR: [WIP] Integrate latest changes to SDFT Shekswess/trl#1 It is work in progress, not working, no need to review it yet

perceptiveshawty · 2026-02-19T03:09:52Z

I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.

this makes sense if its meant to serve as a reference implementation / reproduce results.

that said, it would be easy to just default to self-distillation when ref_model isn't explicitly provided. then having the optionality will support research into effective teachers, differences between model families, etc

Shekswess added 9 commits January 31, 2026 15:11

Add SDFT experimental init

7f3a94a

Add SDFT config

742c67e

Add SDFT trainer

0854002

Add SDFT docs page

7b425da

Add SDFT to docs toctree

acf6548

Add SDFT trainer tests

c115461

Format SDFT init

f0f3411

Format SDFT config

5d641a3

Format SDFT trainer

955d0bd

jonhue mentioned this pull request Feb 2, 2026

Add SDPO (Self-Distillation Policy Optimization) trainer #4935

Open

kirawi mentioned this pull request Feb 4, 2026

What is the difference between it and SDFT? lasgroup/SDPO#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Add SDFT trainer, config, docs, and tests#4941

[Experimental] Add SDFT trainer, config, docs, and tests#4941
Shekswess wants to merge 9 commits intohuggingface:mainfrom
Shekswess:feature/sdft-trainer

Shekswess commented Jan 31, 2026

Uh oh!

Shekswess commented Jan 31, 2026 •

edited

Loading

Uh oh!

jonhue commented Feb 1, 2026

Uh oh!

Shekswess commented Feb 2, 2026

Uh oh!

jonhue commented Feb 2, 2026

Uh oh!

qgallouedec commented Feb 18, 2026

Uh oh!

perceptiveshawty commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

Shekswess commented Jan 31, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

Shekswess commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonhue commented Feb 1, 2026

Uh oh!

Shekswess commented Feb 2, 2026

Uh oh!

jonhue commented Feb 2, 2026

Uh oh!

qgallouedec commented Feb 18, 2026

Uh oh!

perceptiveshawty commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Shekswess commented Jan 31, 2026 •

edited

Loading