Skip to content

[NeurIPS AI4Tab'25] . TabRAG: Tabular Document Retrieval via Structured Language Representations. Paper: .

Notifications You must be signed in to change notification settings

Michellecsds/TabRAG

 
 

Repository files navigation

TabRAG: Tabular Document Retrieval via Structured Language Representations


1Imperial College London 2Columbia University

TabRAG on arXiv MIT License

Installation

The following delineates the installation instructions. Clone this repository and navigate to it in your terminal. Create an environment using a preferred package manager.

Note: can replace conda with uv.

conda create --name tabrag python=3.10
conda activate tabrag

Installing Dependencies

pip install torch
pip install 'git+https://github.com/facebookresearch/detectron2.git' --no-build-isolation
pip install pymupdf
pip install transformers
pip install openai
pip install faiss-gpu
pip install timm
pip install shapely
pip install qwen_vl_utils
pip install scipy
pip install sentence-transformers
pip install gdown
pip install opencv-python
pip install numpy==1.26.4
pip install pypdf
pip install vllm
pip install arxiv

Layout model checkpoint

Microsoft's DIT model (Document Image Transformer) is used for layout extraction: https://github.com/microsoft/unilm/tree/master/dit

Download this checkpoint: https://mail2sysueducn-my.sharepoint.com/:u:/g/personal/huangyp28_mail2_sysu_edu_cn/ESKnk2I_O09Em52V1xb2Ux0BrO_Z-7cuzL3H1KQRxipb7Q?e=iqTfGc

Move it to the project directory

Datasets

Create a datasets/ folder

mkdir datasets
cd datasets

TAT-DQA:

Download the TAT-DQA Dataset from Google Drive

Make a tatdqa/ folder and download the following:

Dataset: gdown https://drive.google.com/uc?id=1iqe5r-qgQZLhGtM4G6LkNp9S6OCwOF2L (unzip this after downloading)

QA Answer Pairs: gdown https://drive.google.com/uc?id=1ZQjjIC0BB14l6t9b1Ryq0t-CNAP6iC2J

Make sure Dataset and Answer Pairs are in datasets/tatdqa/test and datasets/tatdqa/

MP-DocVQA:

wget https://datasets.cvc.uab.es/rrc/DocVQA/Task4/images.tar.gz --no-check-certificate
tar -xvf images.tar.gz
python process_mpdocvqa.py # get documents with tables
python filter_mpdocvqa.py # select 500 pages based on qa:pages ratio
python indent_mpdocvqa.py # visibility of val.json

SPIQA:

# mkdir/cd into datasets/SPIQA
pip install arxiv

# open python shell: python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path

FinTabNet:

wget https://dax-cdn.cdn.appdomain.cloud/dax-fintabnet/1.0.0/fintabnet.tar.gz
tar -xvf fintabnet.tar.gz

Run

python make_ragstore.py

About

[NeurIPS AI4Tab'25] . TabRAG: Tabular Document Retrieval via Structured Language Representations. Paper: .

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%