$A^2Search$: Ambiguity-Aware Question Answering with Reinforcement Learning

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have greatly improved open-domain Question Answering (QA). However, existing approaches still struggle with ambiguous questions that admit multiple valid answers. Standard QA benchmarks—built under the assumption of a single gold answer—produce misleading training signals and fail to reflect this reality.

We introduce $A^2Search$, an annotation-free, end-to-end training framework that detects and handles ambiguity automatically. Our pipeline works by:

Identifying ambiguous questions
Collecting alternative answers through trajectory sampling and evidence verification
Optimizing with RL using the $\mathrm{AnsF1}$ reward, which naturally accommodates multiple valid answers

The figure above shows an ambiguous question from MuSiQue.

ReSearch-32B produces different answers across rollouts—some diverging from the reference but still evidence-supported.
$A^2Search$ instead resolves ambiguity explicitly by retrieving multiple valid answers within a single rollout.

🔑 Key Results

The table above reports results on four multi-hop QA benchmarks under the Exact Match metric.

We measure $\mathrm{AnsF1}/\mathrm{Recall}@k$ with $k$ rollouts.
For $A^2Search$, only $@1$ is shown, since they can produce multiple answers within a single rollout.
For other baselines, where each rollout yields only one answer (so $\mathrm{AnsF1}@1=\mathrm{Recall}@1$), we additionally include $@3$ results to evaluate performance with more rollouts.
The best result in each group is highlighted in bold, and the second best is underlined.
On eight open-domain QA benchmarks, $A^2Search$ achieves state-of-the-art performance.
With only one rollout, $A^2Search$-7B reaches an average $\mathrm{AnsF1}@1$ score of 48.4% on four multi-hop benchmarks, outperforming even much larger models like ReSearch-32B (46.2%).
Extensive analyses confirm that $A^2Search$ not only resolves ambiguity but also generalizes across benchmarks.

👉 Embracing ambiguity is essential for building reliable next-generation QA systems. Please find more experimental results and details in our paper.

🚀 Installation

1. Setup Environment

Create and activate a Python virtual environment:

python3 -m venv venv
source venv/bin/activate

Install PyTorch and FlashAttention:

pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation

2. Install Dependencies

Clone this repo and install requirements:

git clone <this-repo-url>
cd <this-repo-name>
pip install -e .

📚 Data

All training and evaluation datasets are available in the ./dataset directory:

Training:

a2search_musique_2wiki_nq.parquet – constructed using our evidence-verification pipeline.
Development:

musique_random_512_dev.parquet – used for hyperparameter tuning and checkpoint selection.
Evaluation:

Other benchmark datasets are included for evaluation.

Data Structure

Our datasets are all organized in parquet format and follows the same json structure as below

{
  "data_source": "", // dataset name
  "question": "", // the target question
  "reward_model": {
    "ground_truth": [
      { "aliases": [], "answer": "" } // reference answers and their aliases
    ],
  },
  "extra_info": {
    "id": "" // question id
  }
}

For retrieval, we reuse the wiki index and retriever from Search-R1. Make sure you have it running locally at:

http://127.0.0.1:80

🏋️ RL Training

We train using the Verl framework. Training code is under the verl/ directory.

Example scripts:

Single-node debugging: scripts/train_debug.sh
Multi-node training: scripts/train_multinode.sh

⚠️ These scripts provide basic startup commands. For detailed hyperparameter settings, please refer to our paper.

📊 Evaluation

We provide an evaluation script for running experiments across benchmarks:

cd evaluation
bash run_evaluation.sh

You can configure:

Model checkpoint
Dataset
Temperature
Sampling numbers

Before running evaluation, please check:

evaluation/config.py → correct model path
evaluation/lmjudge_agent.py → correct API path and key for the LLM judge module

📝 Citation

If you find this work useful, please cite our paper:

@misc{zhang2025a2searchambiguityawarequestionanswering,
      title={A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning}, 
      author={Fengji Zhang and Xinyao Niu and Chengyang Ying and Guancheng Lin and Zhongkai Hao and Zhou Fan and Chengen Huang and Jacky Keung and Bei Chen and Junyang Lin},
      year={2025},
      eprint={2510.07958},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.07958}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
dataset		dataset
evaluation		evaluation
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

$A^2Search$: Ambiguity-Aware Question Answering with Reinforcement Learning

🔑 Key Results

🚀 Installation

1. Setup Environment

2. Install Dependencies

📚 Data

🏋️ RL Training

📊 Evaluation

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

zfj1998/A2Search

Folders and files

Latest commit

History

Repository files navigation

$A^2Search$: Ambiguity-Aware Question Answering with Reinforcement Learning

🔑 Key Results

🚀 Installation

1. Setup Environment

2. Install Dependencies

📚 Data

🏋️ RL Training

📊 Evaluation

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages