SeDi

This is the official code repository for the paper “Bridging the Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation”, accepted to AAAI 2026. We propose SeDi, a semantics- and distribution-aware knowledge transfer framework for cross-tokenizer distillation.

Experimental Environment Setup

You can install the required dependencies using either of the following methods:

pip install -r requirements.txt

or

conda env create -f environment.yml

Datasets

We evaluate our method on three tasks: instruction following, code generation, and math reasoning.

Method	Train	Test
Instruction Following	Dolly	Snist, Unist, Self-Inst, Vicuna
Code Generation	CodeM	HumanEval
Math Reasoning	MetaMath	Orca, GSM8K, Math

Fine-tuning Teacher and Student Models

If a fine-tuned teacher model is already available, you can directly use it by updating the relevant information in the script file. If no fine-tuned teacher model exists, we recommend fine-tuning one on your target dataset using the following: bash finetune_teacher.sh.

Change MODEL_TYPE to your teacher model type.
Change CKPT_PATH to the path of your pre-trained model.
If you use LoRA fine-tuning, add the following parameters:

--peft lora --peft-lora-r 256 --peft-lora-alpha 8 --peft-lora-dropout 0.1 \

Similarly, we recommend fine-tuning the student model for 3 epochs before distillation using: bash finetune_student.sh.

Also update MODEL_TYPE and CKPT_PATH accordingly.

Running Distillation

We support five baseline methods as well as our proposed SeDi method. You can run each method with the following scripts:

Method	Scripts
MinED	bash minedit.sh
CDM	bash cdm.sh
ULD	bash uld.sh
MultiLevelOT	bash multi_level_OT.sh
DSKD	bash dskd.sh
SeDi	bash sedi.sh

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
criterions		criterions
data		data
data_utils		data_utils
scripts		scripts
README.md		README.md
arguments.py		arguments.py
distillation.py		distillation.py
distiller.py		distiller.py
environment.yml		environment.yml
evaluate.py		evaluate.py
evaluate_code_generation.py		evaluate_code_generation.py
evaluate_dolly.py		evaluate_dolly.py
evaluate_math.py		evaluate_math.py
execution.py		execution.py
init_vocabulary_mapping.py		init_vocabulary_mapping.py
ngram_score.py		ngram_score.py
requirements.txt		requirements.txt
rouge_metric.py		rouge_metric.py
utils.py		utils.py
vocab_mapping.py		vocab_mapping.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeDi

Experimental Environment Setup

Datasets

Fine-tuning Teacher and Student Models

Running Distillation

About

Uh oh!

Releases

Packages

Languages

MaybeLizzy/SEDI

Folders and files

Latest commit

History

Repository files navigation

SeDi

Experimental Environment Setup

Datasets

Fine-tuning Teacher and Student Models

Running Distillation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages