Skip to content

zfj1998/A2Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$A^2Search$: Ambiguity-Aware Question Answering with Reinforcement Learning

📄 Paper🤗 Model Weights

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have greatly improved open-domain Question Answering (QA). However, existing approaches still struggle with ambiguous questions that admit multiple valid answers. Standard QA benchmarks—built under the assumption of a single gold answer—produce misleading training signals and fail to reflect this reality.


We introduce $A^2Search$, an annotation-free, end-to-end training framework that detects and handles ambiguity automatically. Our pipeline works by:

  1. Identifying ambiguous questions
  2. Collecting alternative answers through trajectory sampling and evidence verification
  3. Optimizing with RL using the $\mathrm{AnsF1}$ reward, which naturally accommodates multiple valid answers

The figure above shows an ambiguous question from MuSiQue.

  • ReSearch-32B produces different answers across rollouts—some diverging from the reference but still evidence-supported.
  • $A^2Search$ instead resolves ambiguity explicitly by retrieving multiple valid answers within a single rollout.

🔑 Key Results

The table above reports results on four multi-hop QA benchmarks under the Exact Match metric.

  • We measure $\mathrm{AnsF1}/\mathrm{Recall}@k$ with $k$ rollouts.

  • For $A^2Search$, only $@1$ is shown, since they can produce multiple answers within a single rollout.

  • For other baselines, where each rollout yields only one answer (so $\mathrm{AnsF1}@1=\mathrm{Recall}@1$), we additionally include $@3$ results to evaluate performance with more rollouts.

  • The best result in each group is highlighted in bold, and the second best is underlined.

  • On eight open-domain QA benchmarks, $A^2Search$ achieves state-of-the-art performance.

  • With only one rollout, $A^2Search$-7B reaches an average $\mathrm{AnsF1}@1$ score of 48.4% on four multi-hop benchmarks, outperforming even much larger models like ReSearch-32B (46.2%).

  • Extensive analyses confirm that $A^2Search$ not only resolves ambiguity but also generalizes across benchmarks.

👉 Embracing ambiguity is essential for building reliable next-generation QA systems. Please find more experimental results and details in our paper.

🚀 Installation

1. Setup Environment

Create and activate a Python virtual environment:

python3 -m venv venv
source venv/bin/activate

Install PyTorch and FlashAttention:

pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation

2. Install Dependencies

Clone this repo and install requirements:

git clone <this-repo-url>
cd <this-repo-name>
pip install -e .

📚 Data

All training and evaluation datasets are available in the ./dataset directory:

  • Training:

    a2search_musique_2wiki_nq.parquet – constructed using our evidence-verification pipeline.

  • Development:

    musique_random_512_dev.parquet – used for hyperparameter tuning and checkpoint selection.

  • Evaluation:

    Other benchmark datasets are included for evaluation.

  • Data Structure

    Our datasets are all organized in parquet format and follows the same json structure as below

    {
      "data_source": "", // dataset name
      "question": "", // the target question
      "reward_model": {
        "ground_truth": [
          { "aliases": [], "answer": "" } // reference answers and their aliases
        ],
      },
      "extra_info": {
        "id": "" // question id
      }
    }

For retrieval, we reuse the wiki index and retriever from Search-R1. Make sure you have it running locally at:

http://127.0.0.1:80

🏋️ RL Training

We train using the Verl framework. Training code is under the verl/ directory.

Example scripts:

  • Single-node debugging: scripts/train_debug.sh
  • Multi-node training: scripts/train_multinode.sh

⚠️ These scripts provide basic startup commands. For detailed hyperparameter settings, please refer to our paper.


📊 Evaluation

We provide an evaluation script for running experiments across benchmarks:

cd evaluation
bash run_evaluation.sh

You can configure:

  • Model checkpoint
  • Dataset
  • Temperature
  • Sampling numbers

Before running evaluation, please check:

  • evaluation/config.py → correct model path
  • evaluation/lmjudge_agent.py → correct API path and key for the LLM judge module

📝 Citation

If you find this work useful, please cite our paper:

@misc{zhang2025a2searchambiguityawarequestionanswering,
      title={A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning}, 
      author={Fengji Zhang and Xinyao Niu and Chengyang Ying and Guancheng Lin and Zhongkai Hao and Zhou Fan and Chengen Huang and Jacky Keung and Bei Chen and Junyang Lin},
      year={2025},
      eprint={2510.07958},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.07958}, 
}

About

Learn to resolve ambiguity for Question Answering through RL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published