Skip to content

[Experimental] Add SDFT trainer, config, docs, and tests#4941

Open
Shekswess wants to merge 9 commits intohuggingface:mainfrom
Shekswess:feature/sdft-trainer
Open

[Experimental] Add SDFT trainer, config, docs, and tests#4941
Shekswess wants to merge 9 commits intohuggingface:mainfrom
Shekswess:feature/sdft-trainer

Conversation

@Shekswess
Copy link

What does this PR do?

Adds an experimental Self‑Distillation Fine‑Tuning (SDFT) trainer to TRL, including:

  • SDFTTrainer + SDFTConfig under trl.experimental.sdft
  • strict dataset validation for prompt/teacher_prompt
  • teacher handling (explicit or defaulted from student checkpoint)
  • docs page + toctree entry
  • unit tests (init checks + low‑priority training smoke test)

Fixes #4940

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@qgallouedec

@Shekswess
Copy link
Author

Shekswess commented Jan 31, 2026

@qgallouedec maybe this is not the perfect implementation but overall I think it's okay. I followed the original code from the authors of the papers and that was kinda messy hahahahaha

Please any comments on how we can make this in the best shape possible, improvements, coverage, etc... feel free to drop it and I can help you on this one. This is my first PR like this so I'm really excited. ❤️

P.S I want this trainer to be added as experimental trainer because I want to do active research on self-distillation methods on tiny language models I see it as a possibility to make them even more powerful

@jonhue
Copy link

jonhue commented Feb 1, 2026

Hello @Shekswess, one of the authors here 👋
I had a quick glance over the code & it looks good. Thanks so much for making this PR!!

Currently, this implementation is for offline training (ie training on a fixed dataset of teacher prompts). I was wondering whether we could easily extend this implementation to online training too?
This is what we did in our other paper: https://github.com/lasgroup/SDPO. Unfortunately, our implementation for this is in verl, but it is 1to1 the same algorithm. The only difference is where the teacher prompts come from. In SDPO, the teacher prompts are created "online" using generated trajectories that are marked as correct by the environment and any other rich signal returned by the environment.

Do you think it would make sense to integrate these into one implementation of self-distillation?

@Shekswess
Copy link
Author

Heyoooo @jonhue !
First of all really awesome job with this idea, paper and everything you've done guys in general with this approach. I love it and I cannot wait to test this on tiny language models. About the implementation I think we can actually do "online" version of this. Before starting anything to modify the code, I would want to consult @qgallouedec because this legend knows a lot about trl (as main contributor) and what could be the best way of implementing this in trl. I can do all the heavy work so no worries on that front. The only help I would require is some advices on how to handle this the best as possible in trl. This would be huge for a lot of research folks because the whole approach looks really promising to me and I cannot wait to get my hands dirty hahahaha.

@jonhue
Copy link

jonhue commented Feb 2, 2026

Amazing! Happy to help!

@qgallouedec
Copy link
Member

Hey, sorry for the late review, this one is quite big! Thank you!

At this point, I have a few remarks:

  • I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.
  • in my understanding, we only generate one completion from the teacher and one from the model, so it's fair to just hard code num_generations=1
  • in the paper they say "If needed, add importance sampling to compensate for differences between the inference engine (e.g., VLLM) and the training code.". but don't provide guidance on how they do it. I think for the first implementation, we should just completely drop this IS correction, and maybe add it later.
  • they've been many change in GRPO during the last days/weeks, including tool-calling support which seems to be very important in the paper. I'm trying to integrate all of them in this PR: [WIP] Integrate latest changes to SDFT Shekswess/trl#1 It is work in progress, not working, no need to review it yet

@perceptiveshawty
Copy link

  • I think you should drop the separate ref_model/teacher_model. The motivation of the paper is around self-distillation, I think we should enforce that.

this makes sense if its meant to serve as a reference implementation / reproduce results.

that said, it would be easy to just default to self-distillation when ref_model isn't explicitly provided. then having the optionality will support research into effective teachers, differences between model families, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SDFT: Self-Distillation Fine-Tuning Trainer

4 participants

Comments