MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

This repository contains the official implementation for the ACL 2026 Main paper "MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models".

MoA builds a heterogeneous mixture over multiple adapter structures for parameter-efficient LLM fine-tuning. The repository provides:

Soft MoA: softly fuses outputs from multiple adapter experts.
Sparse MoA: activates adapter experts sparsely with small performance degradation.
Baselines: LoRA, Prompt Tuning, Parallel Adapter, UniPEFT, HydraLoRA, MoLoRA, AdaMoLE, and Top-K MoE-LoRA variants.
Two code paths:
- LLaMA3_*: implementation based on the native LLaMA 3 code path.
- MoA_Transformers/: implementation based on Hugging Face transformers, currently covering Qwen3 and LLaMA-style models.

Repository Layout

.
├── README.md
├── requirements.txt
├── environment.yml
├── datasets/
│   ├── math_14k/
│   ├── commonsense_15k/
│   ├── math_commonsense/
├── LLaMA3_soft_moa/
├── LLaMA3_sparse_moa/
├── LLaMA3_soft_moa_instance/
├── LLaMA3_soft_moe/
├── LLaMA3_sparse_moe/
├── LLaMA3_UniPEFT/
└── MoA_Transformers/

Implementations

Native LLaMA 3 Implementation

The LLaMA3_* directories use the original LLaMA-style checkpoint layout and tokenizer file. They share a similar workflow:

main_finetune.py: fine-tune adapters.
extract_adapter_from_checkpoint.py: extract adapter-only weights from a training checkpoint.
example.py: run generation with the base model plus extracted adapter.
evaluate_math.py / evaluate_commonsense.py: evaluate generated outputs.

Main directories:

Directory	Purpose
`LLaMA3_soft_moa/`	Soft MoA over heterogeneous adapter types, including LoRA, parallel adapter, and prompt modules.
`LLaMA3_sparse_moa/`	Sparse MoA over heterogeneous adapter types, mainly LoRA and parallel adapter modules.
`LLaMA3_soft_moa_instance/`	Instance-level Soft MoA variant.
`LLaMA3_soft_moe/`	Soft MoE-style PEFT baselines, including LoRA, HydraLoRA, MoLoRA, prompt, and parallel adapter variants.
`LLaMA3_sparse_moe/`	Sparse MoE-LoRA baselines, including AdaMoLE and Top-K MoE-LoRA.
`LLaMA3_UniPEFT/`	UniPEFT-style baseline.

Each directory contains public example scripts in exps/.

Transformers Implementation

MoA_Transformers/ integrates MoA with a local modified copy of Hugging Face transformers.

Key files:

Path	Purpose
`MoA_Transformers/train.py`	Fine-tuning entry point. Supports config files via `@configs/...`.
`MoA_Transformers/test.py`	Generic testing entry point for choice-style tasks.
`MoA_Transformers/test_math.py`	Math benchmark generation/testing entry point.
`MoA_Transformers/evaluate_math.py`	Math evaluation script.
`MoA_Transformers/evaluate_code.py`	Code evaluation script.
`MoA_Transformers/src/`	PEFT wrapper, trainer, config, save/load, and MoA adapter code.
`MoA_Transformers/transformers/`	Modified local `transformers` source tree.
`MoA_Transformers/configs/`	Public training/testing configs for Qwen3-8B and Qwen3-14B Soft/Sparse MoA on `math_14k`.

The modified model files are:

MoA_Transformers/transformers/src/transformers/models/qwen3/modeling_qwen3.py
MoA_Transformers/transformers/src/transformers/models/llama/modeling_llama.py

The repository also keeps alternate text copies such as modeling_qwen3_softmoa.py.txt and modeling_llama_softmoa.py.txt. Rename/copy these over the active modeling_*.py files only when you intentionally switch the model implementation variant.

Installation

pip

cd /path/to/moe
pip install -r requirements.txt

For the MoA_Transformers/ implementation:

cd /path/to/moe/MoA_Transformers
pip install -r requirements.txt
cd transformers
pip install -e ".[torch]"

Data

The repository includes task data under datasets/.

Training datasets:

datasets/math_14k/
datasets/commonsense_15k/

Evaluation datasets:

Math: datasets/math_commonsense/AddSub, AQuA, gsm8k, MultiArith, SingleEq, SVAMP
Commonsense: datasets/math_commonsense/boolq, piqa, social_i_qa, hellaswag, winogrande, ARC-Challenge, ARC-Easy, openbookqa

Most training files follow the Alpaca-style JSON format:

{
  "instruction": "...",
  "input": "...",
  "output": "..."
}

Running Native LLaMA 3 Experiments

Run commands from the specific LLaMA3_* directory.

Example: Soft MoA on math_14k.

cd LLaMA3_soft_moa
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_all_math14k_generate_evaluate_seed.sh

Example: Sparse MoA on math_14k.

cd LLaMA3_sparse_moa
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_all_math14k_generate_evaluate_seed.sh

Example: Soft MoA on commonsense_15k.

cd LLaMA3_soft_moa
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_all_commonsense15k_generate_evaluate_seed.sh

The scripts use a path variable, commonly set to /home, to construct model, dataset, and output paths:

path="/home"

Before running, edit this variable so it points to your workspace root. The scripts expect paths like:

${path}/pretrain_models/Meta-Llama-3.1-8B-Instruct/
${path}/datasets/math_14k/train.json
${path}/datasets/math_commonsense/AddSub/test.json
${path}/outputs/...

Native Workflow Details

The public scripts perform the following steps:

Train:

torchrun --nproc_per_node ${num_devices} main_finetune.py \
  --llama_path Meta-Llama-3.1-8B-Instruct/ \
  --data_path datasets/math_14k/train.json \
  --output_dir outputs/run_name/

Extract adapter weights:

python extract_adapter_from_checkpoint.py \
  --checkpoint outputs/run_name/checkpoint-1.pth

This writes adapter files such as:

adapter.pth
adapter_params.json

Generate predictions:

torchrun --nproc_per_node ${num_devices} example.py \
  --ckpt_dir Meta-Llama-3.1-8B-Instruct/ \
  --adapter_path outputs/run_name/adapter.pth \
  --data_path datasets/math_commonsense/AddSub/test.json \
  --save_path outputs/run_name/AddSub_predict_mingen120.jsonl \
  --max_gen_len 200 \
  --min_gen_len 120

Evaluate:

python evaluate_math.py \
  --predict_file outputs/run_name/AddSub_predict_mingen120.jsonl

For commonsense tasks:

python evaluate_commonsense.py \
  --predict_file outputs/run_name/boolq_predict_mingen10.jsonl

Running Baselines

The baseline directories use the same train/extract/generate/evaluate workflow.

Examples:

cd LLaMA3_soft_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_lora_math14k_generate_evaluate_seed.sh

cd LLaMA3_soft_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_hydralora_math14k_generate_evaluate_seed.sh

cd LLaMA3_sparse_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_adamole_lora_math14k_generate_evaluate_seed.sh

cd LLaMA3_sparse_moe
export CUDA_VISIBLE_DEVICES=0
bash exps/finetuning_llama3-1_topkmoe_lora_math14k_generate_evaluate_seed.sh

Running MoA_Transformers

Install the local modified transformers first:

cd MoA_Transformers
pip install -r requirements.txt
cd transformers
pip install -e ".[torch]"
cd ..

Train Sparse MoA with Qwen3-8B on math_14k:

python train.py @configs/qwen3-8b_sparsemoa_math14k_train.config

Train Soft MoA with Qwen3-8B on math_14k:

python train.py @configs/qwen3-8b_softmoa_math14k_train.config

Test a trained Sparse MoA model:

python test.py @configs/qwen3-8b_sparsemoa_math14k_test.config

Evaluate math predictions:

python evaluate_math.py --predict_file /path/to/predictions/addsub_responses.jsonl

Config Files

Config files in MoA_Transformers/configs/ are passed with Python argparse's fromfile_prefix_chars='@' syntax. For example:

--model_path=Qwen/Qwen3-8B
--data_path=/path/to/moe/datasets/math_14k
--peft_type=sparsemoa
--lora_rank=8
--target_modules
q_proj
k_proj
v_proj
o_proj
down_proj
--max_length=300
--batch_size=8
--gradient_accumulation_steps=2
--num_train_epochs=1
--learning_rate=1e-4
--lr_scheduler_type=constant_with_warmup
--warmup_steps=200
--weight_decay=0.0

Available public configs cover:

qwen3-8b_softmoa_math14k_*
qwen3-8b_sparsemoa_math14k_*
qwen3-14b_softmoa_math14k_*
qwen3-14b_sparsemoa_math14k_*
seed variants with seed125 and seed1225

By default, train.py saves outputs under:

MoA_Transformers/outputs/<model>-<peft_type>-<dataset>/

For example:

outputs/qwen3-8b-sparsemoa-math-14k/

Important Hyperparameters

Common native LLaMA 3 arguments:

Argument	Meaning
`--lora_layers`	Layer range for LoRA modules, e.g. `0-32`.
`--lora_rank`	LoRA rank.
`--lora_targets`	Target modules, e.g. `Q,K,V,O,FFN_DOWN`.
`--lora_alpha`	LoRA scaling alpha.
`--p_adapter_layers`	Layer range for parallel adapters.
`--p_adapter_size`	Hidden size of parallel adapters.
`--prompt_layers`	Layer range for prompt modules.
`--prompt_len`	Prompt length.
`--expert_num`	Number of experts for MoE-style baselines.
`--swi_x`	Router hidden-size multiplier for SwiGLU router; `0` uses a linear router.
`--max_threshold`	Sparse MoA activation threshold.
`--batch_size`	Per-GPU batch size.
`--accum_iter`	Gradient accumulation steps.
`--bf16`	Use bfloat16.
`--flash_attention2`	Enable FlashAttention 2 when available.

Common MoA_Transformers arguments:

Argument	Meaning
`--model_path`	HF model id or local HF model path.
`--data_path`	HF dataset id or local dataset directory.
`--peft_type`	`softmoa` or `sparsemoa`.
`--target_modules`	Transformer module names to receive adapters.
`--lora_rank`	Adapter rank.
`--max_length`	Maximum tokenized sequence length.
`--batch_size`	Per-device batch size.
`--gradient_accumulation_steps`	Gradient accumulation steps.
`--num_train_epochs`	Number of training epochs.
`--learning_rate`	Training learning rate.
`--seed`	Random seed.

Citation

If you find MoA useful in your projects, please consider citing our paper:

@article{cao2025moa,
  title={MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models},
  author={Cao, Jie and Lin, Tianwei and He, Hongyang and Yan, Rolan and Zhang, Wenqiao and Li, Juncheng and Zhang, Dongping and Tang, Siliang and Zhuang, Yueting},
  journal={arXiv preprint arXiv:2506.05928},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Repository Layout

Implementations

Native LLaMA 3 Implementation

Transformers Implementation

Installation

pip

Data

Running Native LLaMA 3 Experiments

Native Workflow Details

Running Baselines

Running MoA_Transformers

Config Files

Important Hyperparameters

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LLaMA3_UniPEFT		LLaMA3_UniPEFT
LLaMA3_soft_moa		LLaMA3_soft_moa
LLaMA3_soft_moa_instance		LLaMA3_soft_moa_instance
LLaMA3_soft_moe		LLaMA3_soft_moe
LLaMA3_sparse_moa		LLaMA3_sparse_moa
LLaMA3_sparse_moe		LLaMA3_sparse_moe
MoA_Transformers		MoA_Transformers
datasets		datasets
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Repository Layout

Implementations

Native LLaMA 3 Implementation

Transformers Implementation

Installation

pip

Data

Running Native LLaMA 3 Experiments

Native Workflow Details

Running Baselines

Running MoA_Transformers

Config Files

Important Hyperparameters

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages