Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have greatly improved open-domain Question Answering (QA). However, existing approaches still struggle with ambiguous questions that admit multiple valid answers. Standard QA benchmarks—built under the assumption of a single gold answer—produce misleading training signals and fail to reflect this reality.
We introduce
- Identifying ambiguous questions
- Collecting alternative answers through trajectory sampling and evidence verification
-
Optimizing with RL using the
$\mathrm{AnsF1}$ reward, which naturally accommodates multiple valid answers
The figure above shows an ambiguous question from MuSiQue.
- ReSearch-32B produces different answers across rollouts—some diverging from the reference but still evidence-supported.
-
$A^2Search$ instead resolves ambiguity explicitly by retrieving multiple valid answers within a single rollout.
The table above reports results on four multi-hop QA benchmarks under the Exact Match metric.
-
We measure
$\mathrm{AnsF1}/\mathrm{Recall}@k$ with$k$ rollouts. -
For
$A^2Search$ , only$@1$ is shown, since they can produce multiple answers within a single rollout. -
For other baselines, where each rollout yields only one answer (so
$\mathrm{AnsF1}@1=\mathrm{Recall}@1$ ), we additionally include$@3$ results to evaluate performance with more rollouts. -
The best result in each group is highlighted in bold, and the second best is underlined.
-
On eight open-domain QA benchmarks,
$A^2Search$ achieves state-of-the-art performance. -
With only one rollout,
$A^2Search$ -7B reaches an average$\mathrm{AnsF1}@1$ score of 48.4% on four multi-hop benchmarks, outperforming even much larger models like ReSearch-32B (46.2%). -
Extensive analyses confirm that
$A^2Search$ not only resolves ambiguity but also generalizes across benchmarks.
👉 Embracing ambiguity is essential for building reliable next-generation QA systems. Please find more experimental results and details in our paper.
Create and activate a Python virtual environment:
python3 -m venv venv
source venv/bin/activateInstall PyTorch and FlashAttention:
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolationClone this repo and install requirements:
git clone <this-repo-url>
cd <this-repo-name>
pip install -e .All training and evaluation datasets are available in the ./dataset directory:
-
Training:
a2search_musique_2wiki_nq.parquet– constructed using our evidence-verification pipeline. -
Development:
musique_random_512_dev.parquet– used for hyperparameter tuning and checkpoint selection. -
Evaluation:
Other benchmark datasets are included for evaluation.
-
Data Structure
Our datasets are all organized in
parquetformat and follows the same json structure as below{ "data_source": "", // dataset name "question": "", // the target question "reward_model": { "ground_truth": [ { "aliases": [], "answer": "" } // reference answers and their aliases ], }, "extra_info": { "id": "" // question id } }
For retrieval, we reuse the wiki index and retriever from Search-R1. Make sure you have it running locally at:
http://127.0.0.1:80
We train using the Verl framework.
Training code is under the verl/ directory.
Example scripts:
- Single-node debugging:
scripts/train_debug.sh - Multi-node training:
scripts/train_multinode.sh
⚠️ These scripts provide basic startup commands. For detailed hyperparameter settings, please refer to our paper.
We provide an evaluation script for running experiments across benchmarks:
cd evaluation
bash run_evaluation.shYou can configure:
- Model checkpoint
- Dataset
- Temperature
- Sampling numbers
Before running evaluation, please check:
evaluation/config.py→ correct model pathevaluation/lmjudge_agent.py→ correct API path and key for the LLM judge module
If you find this work useful, please cite our paper:
@misc{zhang2025a2searchambiguityawarequestionanswering,
title={A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning},
author={Fengji Zhang and Xinyao Niu and Chengyang Ying and Guancheng Lin and Zhongkai Hao and Zhou Fan and Chengen Huang and Jacky Keung and Bei Chen and Junyang Lin},
year={2025},
eprint={2510.07958},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.07958},
}

