Accepted Paper in ACS Catalysis
DOI link: https://doi.org/10.1021/acscatal.5c06431 (Available after proof)
- Models supported:
openai_*,google_*,deepseek_*,openrouter_*(set the matching API key) - Primary scripts:
python -m CATDA.extract_main: extract CatGraph and/or generate ML datasetpython -m CATDA.tools.neo4j.neo4j_import: import CatGraph JSON into Neo4jpython -m CATDA.launch_gradio: launch CatAgent UI for querying
Set these before running commands (Windows PowerShell examples):
- Model provider API key (depending on
--modelyou choose):OPENAI_API_KEY, orGOOGLE_API_KEY, orDEEPSEEK_API_KEY, orOPENROUTER_API_KEY
- Neo4j connection:
NEO4J_URI(defaultneo4j://localhost:7687)NEO4J_USER(defaultneo4j)NEO4J_PASSWORD(required)
- Optional regex mapping files for resolvers:
NAME_RESOLVER_REGEX_MAP(path to JSON)FIELD_RESOLVER_REGEX_MAP(path to JSON)
Examples:
$env:GOOGLE_API_KEY = "<your_key>"
$env:NEO4J_PASSWORD = "<neo4j_password>"
# Optional
$env:NAME_RESOLVER_REGEX_MAP = "D:\configs\name_regex.json"
$env:FIELD_RESOLVER_REGEX_MAP = "D:\configs\field_regex.json"OpenRouter note:
- Use model names like
openrouter_openai/gpt-4o-miniwithOPENROUTER_API_KEY.
Note: there seems to be an issue that PaddleOCR client may have confict with current conversation agent's similar keyword fetch. Please test and consider if you should create a new environment seperately for preprocess only.
Use the client script pdf_preprocess/PaddleOCR_vl_pdf2md_client.py for PDF → Markdown preprocessing.
This repository only provides the client. Deploy a PaddleOCR‑VL server separately using the official guide:
paddleocr.ai/latest/en/version3.x/pipeline_usage/PaddleOCR-VL.html
Example (batch directory):
python pdf_preprocess/PaddleOCR_vl_pdf2md_client.py <input_pdf_or_dir> \
-o <output_dir> \
--vllm-url http://127.0.0.1:8118/v1 \
--pdf-batch-size 4 \
--vl-rec-max-concurrency 128 \
--clean-html-tables
Notes:
- If you wish paddleocr to read chart figures for you, add --expand-charts
- Output is written as
<output_dir>/<relative_path>/<stem>/<stem>.mdwith animg/folder. - Clean HTML tables are good enough for LLM extraction in recent tests (see:
https://www.improvingagents.com/blog/best-input-data-format-for-llms). - Downstream CATDA extraction expects files under one root directory; pass
--file-ext .txtif your outputs are.txt(default is.md).
Command (run from the project root):
export PYTHONPATH=.
python -m extract_main.py <input_path> \
--output-dir <out_dir> \
--file-ext .md \
--mode both \
--processes 4 \
--feature-file CATDA/prompts/features_to_extract.txtKey flags:
input_path: file or directory of source texts--output-dir: where results are written (createsgraph/,dataset/,metadata/)--file-ext: input extension filter (default.md)--mode:extract|generate-ml-only|both(defaultboth)--graph-pattern: pattern for existing graphs when usinggenerate-ml-only(default*_output.jsonunder<out_dir>/graph)--feature-file: feature definitions for the ML dataset (defaultCATDA/prompts/features_to_extract.txt)
Outputs:
- CatGraph JSON:
<out_dir>/graph/<run_id>_output.json - ML dataset (TSV):
<out_dir>/dataset/<run_id>_dataset.tsv - Aux logs/metadata under
<out_dir>/metadata/and temp artifacts
Start Neo4j, then run:
export PYTHONPATH=.
python -m tools/neo4j/neo4j_import <input_json_or_dir> \
--neo4j_uri neo4j://localhost:7687 \
--neo4j_user neo4j \
--neo4j_password "$NEO4J_PASSWORD" \
--clearNotes:
--clearwipes the DB before the first import; omit for incremental loads- For a single file,
--paper_nameoverrides the Paper node name (otherwise derived from filename)
export PYTHONPATH=.
python -m launch_gradio.py \
--model google_gemini-2.5-pro \
--neo4j-password "$NEO4J_PASSWORD" \
--gradio-port 6810 \
--listen-allOptional overrides:
--name-regex-map,--field-regex-mapto supply resolver JSONs (see examples below)
If provided, exact/regex matches take precedence over vector search.
Object-map format:
{
"mfi|mobil five|mordenite": "MFI",
"silic.*": "silica"
}List format:
[
{"pattern": "conv(ersion)?\\s*rate", "name": "result_conversion_pct"},
{"pattern": "yield", "key": "result_yield_pct"}
]- Edit the feature spec in
CATDA/prompts/features_to_extract.txt(PX-ISOMERIZATION example) orCATDA/prompts/features_to_extract_OCM.txt(OCM example). The expected format is a simple|-separated description list with two sections: “Catalyst properties” and “Testing condition/metrics”. Because it will be provided as a text block, so format is free as LLM will correctly handle it. - Point the extractor at a custom file with
--feature-file <path>when runningCATDA.extract_main.
Typical call with a custom feature file:
export PYTHONPATH=.
python -m extract_main.py <input_path> \
--output-dir out \
--mode generate-ml-only \
--feature-file D:/configs/my_features.txtAdjust the extraction prompts used to build CatGraph:
- File:
CATDA/prompts/extract_prompts.py- Core synthesis prompt: variable
synthesis_graph_prompt - Post-check for synthesis:
synthesis_missing_check_prompt - The same file also contains prompts for characterization/testing phases
- Core synthesis prompt: variable
Guidance:
- Preserve the overall output schema keys expected by the importer and downstream tools (
nodes,edges, field names likesynthesis_input,synthesis_output, etc.) - Tighten or relax rules in the prompt sections (e.g., condition capture, property handling) to suit your domain
- Keep JSON-only answers enforced in the prompt to simplify parsing
- Extract CatGraph only
export PYTHONPATH=.;python -m extract_main.py data/OCM_articles_MD --output-dir out_v1 --file-ext .md --mode extract --processes 4- Generate ML dataset from existing CatGraph
export PYTHONPATH=.;python -m extract_main out_v1 --output-dir out_v1 --mode generate-ml-only --graph-pattern "*_output.json" --feature-file CATDA/prompts/features_to_extract.txt- Import to Neo4j
export PYTHONPATH=.;python -m tools.neo4j.neo4j_import out_v1/graph --neo4j_user neo4j --neo4j_password "$NEO4J_PASSWORD" --clear- Launch CatAgent
export PYTHONPATH=.;python -m launch_gradio --model google_gemini-2.5-pro --neo4j-password "$NEO4J_PASSWORD" --gradio-port 6810- Clone
git clone <repository-url>
cd <repository-root>- Environment
- Conda (recommended):
conda create -n catda python=3.10 -y
conda activate catda
pip install -r CATDA/requirements.txt- venv (alternative):
python -m venv venv
venv\Scripts\activate # Windows
# or: source venv/bin/activate
pip install -r CATDA/requirements.txt- LLM-powered extraction and reasoning
- CatGraph representation for catalyst synthesis/testing
- Full-document parsing (beyond abstracts)
- CatAgent for interactive, grounded queries over Neo4j graphs
- ML-ready dataset generation from CatGraph
CATDA/
├── .cache/ # Cache directory
├── agentic_tools/ # Tools for the agentic components
├── docs/ # Documentation
├── examples/ # Example files and use cases
├── models/ # Model wrappers and utilities
├── prompts/ # LLM prompts and feature specs
├── service/ # Service-related code
├── tools/ # Utilities (incl. neo4j importer)
├── ui/ # Gradio UI
├── extract_main.py # Extraction + dataset generation entrypoint
├── launch_gradio.py # Launch the Gradio UI
├── requirements.txt # Python dependencies
└── README.md
Contributions are welcome! Please open issues/PRs for improvements.
We are currently working ahead for a multimodel version of CATDA, CATDA-MM. If you are interested in the project and want to collaborate, you may contact the corresponding author of the paper or me for further work.