This repository contains the official implementation for the ACL 2026 Main paper "MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models".
MoA builds a heterogeneous mixture over multiple adapter structures for parameter-efficient LLM fine-tuning. The repository provides:
- Soft MoA: softly fuses outputs from multiple adapter experts.
- Sparse MoA: activates adapter experts sparsely with small performance degradation.
- Baselines: LoRA, Prompt Tuning, Parallel Adapter, UniPEFT, HydraLoRA, MoLoRA, AdaMoLE, and Top-K MoE-LoRA variants.
- Two code paths:
LLaMA3_*: implementation based on the native LLaMA 3 code path.MoA_Transformers/: implementation based on Hugging Facetransformers, currently covering Qwen3 and LLaMA-style models.
.
├── README.md
├── requirements.txt
├── environment.yml
├── datasets/
│ ├── math_14k/
│ ├── commonsense_15k/
│ ├── math_commonsense/
├── LLaMA3_soft_moa/
├── LLaMA3_sparse_moa/
├── LLaMA3_soft_moa_instance/
├── LLaMA3_soft_moe/
├── LLaMA3_sparse_moe/
├── LLaMA3_UniPEFT/
└── MoA_Transformers/
The LLaMA3_* directories use the original LLaMA-style checkpoint layout and tokenizer file. They share a similar workflow:
main_finetune.py: fine-tune adapters.extract_adapter_from_checkpoint.py: extract adapter-only weights from a training checkpoint.example.py: run generation with the base model plus extracted adapter.evaluate_math.py/evaluate_commonsense.py: evaluate generated outputs.
Main directories:
| Directory | Purpose |
|---|---|
LLaMA3_soft_moa/ |
Soft MoA over heterogeneous adapter types, including LoRA, parallel adapter, and prompt modules. |
LLaMA3_sparse_moa/ |
Sparse MoA over heterogeneous adapter types, mainly LoRA and parallel adapter modules. |
LLaMA3_soft_moa_instance/ |
Instance-level Soft MoA variant. |
LLaMA3_soft_moe/ |
Soft MoE-style PEFT baselines, including LoRA, HydraLoRA, MoLoRA, prompt, and parallel adapter variants. |
LLaMA3_sparse_moe/ |
Sparse MoE-LoRA baselines, including AdaMoLE and Top-K MoE-LoRA. |
LLaMA3_UniPEFT/ |
UniPEFT-style baseline. |
Each directory contains public example scripts in exps/.
MoA_Transformers/ integrates MoA with a local modified copy of Hugging Face transformers.
Key files:
| Path | Purpose |
|---|---|
MoA_Transformers/train.py |
Fine-tuning entry point. Supports config files via @configs/.... |
MoA_Transformers/test.py |
Generic testing entry point for choice-style tasks. |
MoA_Transformers/test_math.py |
Math benchmark generation/testing entry point. |
MoA_Transformers/evaluate_math.py |
Math evaluation script. |
MoA_Transformers/evaluate_code.py |
Code evaluation script. |
MoA_Transformers/src/ |
PEFT wrapper, trainer, config, save/load, and MoA adapter code. |
MoA_Transformers/transformers/ |
Modified local transformers source tree. |
MoA_Transformers/configs/ |
Public training/testing configs for Qwen3-8B and Qwen3-14B Soft/Sparse MoA on math_14k. |
The modified model files are:
MoA_Transformers/transformers/src/transformers/models/qwen3/modeling_qwen3.pyMoA_Transformers/transformers/src/transformers/models/llama/modeling_llama.py
The repository also keeps alternate text copies such as modeling_qwen3_softmoa.py.txt and modeling_llama_softmoa.py.txt. Rename/copy these over the active modeling_*.py files only when you intentionally switch the model implementation variant.
cd /path/to/moe
pip install -r requirements.txtFor the MoA_Transformers/ implementation:
cd /path/to/moe/MoA_Transformers
pip install -r requirements.txt
cd transformers
pip install -e ".[torch]"The repository includes task data under datasets/.
Training datasets:
datasets/math_14k/datasets/commonsense_15k/
Evaluation datasets:
- Math:
datasets/math_commonsense/AddSub,AQuA,gsm8k,MultiArith,SingleEq,SVAMP - Commonsense:
datasets/math_commonsense/boolq,piqa,social_i_qa,hellaswag,winogrande,ARC-Challenge,ARC-Easy,openbookqa
Most training files follow the Alpaca-style JSON format:
{
"instruction": "...",
"input": "...",
"output": "..."
}Run commands from the specific LLaMA3_* directory.
Example: Soft MoA on math_14k.
cd LLaMA3_soft_moa
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_all_math14k_generate_evaluate_seed.shExample: Sparse MoA on math_14k.
cd LLaMA3_sparse_moa
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_all_math14k_generate_evaluate_seed.shExample: Soft MoA on commonsense_15k.
cd LLaMA3_soft_moa
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_all_commonsense15k_generate_evaluate_seed.shThe scripts use a path variable, commonly set to /home, to construct model, dataset, and output paths:
path="/home"Before running, edit this variable so it points to your workspace root. The scripts expect paths like:
${path}/pretrain_models/Meta-Llama-3.1-8B-Instruct/
${path}/datasets/math_14k/train.json
${path}/datasets/math_commonsense/AddSub/test.json
${path}/outputs/...
The public scripts perform the following steps:
- Train:
torchrun --nproc_per_node ${num_devices} main_finetune.py \
--llama_path Meta-Llama-3.1-8B-Instruct/ \
--data_path datasets/math_14k/train.json \
--output_dir outputs/run_name/- Extract adapter weights:
python extract_adapter_from_checkpoint.py \
--checkpoint outputs/run_name/checkpoint-1.pthThis writes adapter files such as:
adapter.pth
adapter_params.json
- Generate predictions:
torchrun --nproc_per_node ${num_devices} example.py \
--ckpt_dir Meta-Llama-3.1-8B-Instruct/ \
--adapter_path outputs/run_name/adapter.pth \
--data_path datasets/math_commonsense/AddSub/test.json \
--save_path outputs/run_name/AddSub_predict_mingen120.jsonl \
--max_gen_len 200 \
--min_gen_len 120- Evaluate:
python evaluate_math.py \
--predict_file outputs/run_name/AddSub_predict_mingen120.jsonlFor commonsense tasks:
python evaluate_commonsense.py \
--predict_file outputs/run_name/boolq_predict_mingen10.jsonlThe baseline directories use the same train/extract/generate/evaluate workflow.
Examples:
cd LLaMA3_soft_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_lora_math14k_generate_evaluate_seed.shcd LLaMA3_soft_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_hydralora_math14k_generate_evaluate_seed.shcd LLaMA3_sparse_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_adamole_lora_math14k_generate_evaluate_seed.shcd LLaMA3_sparse_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_topkmoe_lora_math14k_generate_evaluate_seed.shInstall the local modified transformers first:
cd MoA_Transformers
pip install -r requirements.txt
cd transformers
pip install -e ".[torch]"
cd ..Train Sparse MoA with Qwen3-8B on math_14k:
python train.py @configs/qwen3-8b_sparsemoa_math14k_train.configTrain Soft MoA with Qwen3-8B on math_14k:
python train.py @configs/qwen3-8b_softmoa_math14k_train.configTest a trained Sparse MoA model:
python test.py @configs/qwen3-8b_sparsemoa_math14k_test.configEvaluate math predictions:
python evaluate_math.py --predict_file /path/to/predictions/addsub_responses.jsonlConfig files in MoA_Transformers/configs/ are passed with Python argparse's fromfile_prefix_chars='@' syntax. For example:
--model_path=Qwen/Qwen3-8B
--data_path=/path/to/moe/datasets/math_14k
--peft_type=sparsemoa
--lora_rank=8
--target_modules
q_proj
k_proj
v_proj
o_proj
down_proj
--max_length=300
--batch_size=8
--gradient_accumulation_steps=2
--num_train_epochs=1
--learning_rate=1e-4
--lr_scheduler_type=constant_with_warmup
--warmup_steps=200
--weight_decay=0.0
Available public configs cover:
qwen3-8b_softmoa_math14k_*qwen3-8b_sparsemoa_math14k_*qwen3-14b_softmoa_math14k_*qwen3-14b_sparsemoa_math14k_*- seed variants with
seed125andseed1225
By default, train.py saves outputs under:
MoA_Transformers/outputs/<model>-<peft_type>-<dataset>/
For example:
outputs/qwen3-8b-sparsemoa-math-14k/
Common native LLaMA 3 arguments:
| Argument | Meaning |
|---|---|
--lora_layers |
Layer range for LoRA modules, e.g. 0-32. |
--lora_rank |
LoRA rank. |
--lora_targets |
Target modules, e.g. Q,K,V,O,FFN_DOWN. |
--lora_alpha |
LoRA scaling alpha. |
--p_adapter_layers |
Layer range for parallel adapters. |
--p_adapter_size |
Hidden size of parallel adapters. |
--prompt_layers |
Layer range for prompt modules. |
--prompt_len |
Prompt length. |
--expert_num |
Number of experts for MoE-style baselines. |
--swi_x |
Router hidden-size multiplier for SwiGLU router; 0 uses a linear router. |
--max_threshold |
Sparse MoA activation threshold. |
--batch_size |
Per-GPU batch size. |
--accum_iter |
Gradient accumulation steps. |
--bf16 |
Use bfloat16. |
--flash_attention2 |
Enable FlashAttention 2 when available. |
Common MoA_Transformers arguments:
| Argument | Meaning |
|---|---|
--model_path |
HF model id or local HF model path. |
--data_path |
HF dataset id or local dataset directory. |
--peft_type |
softmoa or sparsemoa. |
--target_modules |
Transformer module names to receive adapters. |
--lora_rank |
Adapter rank. |
--max_length |
Maximum tokenized sequence length. |
--batch_size |
Per-device batch size. |
--gradient_accumulation_steps |
Gradient accumulation steps. |
--num_train_epochs |
Number of training epochs. |
--learning_rate |
Training learning rate. |
--seed |
Random seed. |
If you find MoA useful in your projects, please consider citing our paper:
@article{cao2025moa,
title={MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models},
author={Cao, Jie and Lin, Tianwei and He, Hongyang and Yan, Rolan and Zhang, Wenqiao and Li, Juncheng and Zhang, Dongping and Tang, Siliang and Zhuang, Yueting},
journal={arXiv preprint arXiv:2506.05928},
year={2025}
}