TabRAG: Tabular Document Retrieval via Structured Language Representations

Jacob Si^*1 | Mike Qu^2* | Michelle Lee¹ | Yingzhen Li¹

¹Imperial College London ²Columbia University

Installation

The following delineates the installation instructions. Clone this repository and navigate to it in your terminal. Create an environment using a preferred package manager.

Note: can replace conda with uv.

conda create --name tabrag python=3.10
conda activate tabrag

Installing Dependencies

pip install torch
pip install 'git+https://github.com/facebookresearch/detectron2.git' --no-build-isolation
pip install pymupdf
pip install transformers
pip install openai
pip install faiss-gpu
pip install timm
pip install shapely
pip install qwen_vl_utils
pip install scipy
pip install sentence-transformers
pip install gdown
pip install opencv-python
pip install numpy==1.26.4
pip install pypdf
pip install vllm
pip install arxiv

Layout model checkpoint

Microsoft's DIT model (Document Image Transformer) is used for layout extraction: https://github.com/microsoft/unilm/tree/master/dit

Download this checkpoint: https://mail2sysueducn-my.sharepoint.com/:u:/g/personal/huangyp28_mail2_sysu_edu_cn/ESKnk2I_O09Em52V1xb2Ux0BrO_Z-7cuzL3H1KQRxipb7Q?e=iqTfGc

Move it to the project directory

Datasets

Create a datasets/ folder

mkdir datasets
cd datasets

TAT-DQA:

Download the TAT-DQA Dataset from Google Drive

Make a tatdqa/ folder and download the following:

Dataset: gdown https://drive.google.com/uc?id=1iqe5r-qgQZLhGtM4G6LkNp9S6OCwOF2L (unzip this after downloading)

QA Answer Pairs: gdown https://drive.google.com/uc?id=1ZQjjIC0BB14l6t9b1Ryq0t-CNAP6iC2J

Make sure Dataset and Answer Pairs are in datasets/tatdqa/test and datasets/tatdqa/

MP-DocVQA:

wget https://datasets.cvc.uab.es/rrc/DocVQA/Task4/images.tar.gz --no-check-certificate
tar -xvf images.tar.gz
python process_mpdocvqa.py # get documents with tables
python filter_mpdocvqa.py # select 500 pages based on qa:pages ratio
python indent_mpdocvqa.py # visibility of val.json

SPIQA:

# mkdir/cd into datasets/SPIQA
pip install arxiv

# open python shell: python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path

FinTabNet:

wget https://dax-cdn.cdn.appdomain.cloud/dax-fintabnet/1.0.0/fintabnet.tar.gz
tar -xvf fintabnet.tar.gz

Run

python make_ragstore.py

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
object_detection		object_detection
src		src
.gitignore		.gitignore
README.md		README.md
eval_generation.py		eval_generation.py
eval_mpdocvqa_mrr.py		eval_mpdocvqa_mrr.py
eval_spiqa_mrr.py		eval_spiqa_mrr.py
eval_wikitq_mrr.py		eval_wikitq_mrr.py
main.py		main.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TabRAG: Tabular Document Retrieval via Structured Language Representations

Installation

Layout model checkpoint

Datasets

Run

About

Uh oh!

Releases

Packages

Languages

Michellecsds/TabRAG

Folders and files

Latest commit

History

Repository files navigation

TabRAG: Tabular Document Retrieval via Structured Language Representations

Installation

Layout model checkpoint

Datasets

Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages