A large suite of tools and scripts for downloading and processing documents for the ARAIA project.
The primary utility is the climpdf command-line tool, with the following commands:
Usage: climpdf [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
convert Convert PDFs in a given directory ``source`` to json.
count-local Count the number of downloaded files from a given source. Creates a checkpoint file.
count-remote-osti Count potentially downloadable files from OSTI, for any number of search terms. Leave blank for all.
crawl-epa Asynchronously crawl EPA result pages.
crawl-osti Asynchronously crawl OSTI result pages.
complete-semantic-scholar Download documents from Semantic Scholar that match a given input file containing document ID.
epa-ocr-to-json Convert EPA's OCR fulltext to similar json format as internal schema.
section-dataset Preprocess full-text files in s2orc/pes2o format into headers and subsections.These will be described in more detail below.
The scripts directory contains additional tools for associating metadata with documents and for updating checkpoint files.
git clone https://github.com/project-araia/climpdfgetter.git
cd climpdfgetterThen either:
Recommended: Use Pixi to take advantage of the included, guaranteed-working environment:
curl -fsSL https://pixi.sh/install.sh | sh
pixi shell -e climpdfOr:
pip install -e .Note that dependency resolution issues are much less likely with Pixi.
Multiple provided search terms are collected in parallel.
Usage: climpdf crawl-osti [OPTIONS] START_YEAR STOP_YEAR
Specify the start year and stop year range for document publishing, then
any number of -t <term>. For instance:
climpdf crawl-osti 2010 2025 -t Blizzard -t Tornado -t "Heat Waves"
Notes:
- OSTI limits search results to 1000 for each term.
Use
climpdf count-remote-osti [OPTIONS] START_YEAR STOP_YEARto help adjust year ranges. - Corresponding metadata is also downloaded.
- Run
climpdf count-local OSTIbetween searches to determine the number of documents downloaded from OSTI, and update the local checkpoint file. The checkpoint prevents downloading duplicates.
Usage: climpdf crawl-epa [OPTIONS] STOP_IDX START_IDX
Specify the stop index and start index out of the search results, then any
number of -t <term>, for instance:
climpdf crawl-epa 100 0 -t Flooding
Usage: climpdf count-local [OPTIONS] SOURCE
Specify a source to count the number of downloaded files. Directories prefixed with SOURCE are
assumed to contain downloaded files corresponding to that source.
Also creates a SOURCE_docs_ids.json in the data directory.
This file is treated as a checkpoint file, and is referenced by climpdf crawl-osti and climpdf crawl-epa.
For instance:
$ climpdf count-local EPA
2342Usage: climpdf convert [OPTIONS] SOURCE
Convert PDFs in a given directory ``source`` to json. If the input files are
of a different format, they'll first be converted to PDF.
Options:
-i, --images-tables
-o, --output-dir TEXT
-g, --grobid_service TEXTCollects downloaded files in a given directory and:
- Convert non-PDF documents to PDF if eligible (png, tiff, etc.).
- Extract text using Grobid - recommended or Open Parse.
- [In active development] Extract images and tables from text using Layout Parser
- Dump text to
<output_dir>.json. - If 3. is enabled, save tables and images to a per-document directory.
For instance:
climpdf convert data/EPA_2024-12-18_15:09:27 or
climpdf convert data/EPA_2024-12-18_15:09:27 --grobid_service http://localhost:8080.
Eligible documents are collected from subdirectories.
Problematic documents are noted as-such for future conversion attempts.
Usage: climpdf section-dataset [OPTIONS] SOURCE
Preprocess full-text files in s2orc/grobid format into headers and subsections.
This utility scans a directory for files presumably in the s2orc format;
that have been processed by Grobid as the original authors of s2orc did, and like
climpdf convert --grobid_service.
For instance:
climpdf section-dataset data/OSTI_2024-12-18_15:09:27
The parsed documents are scanned for titles, headers, and associated subsections. A heuristic rejects headers and subsections that are too short, too long, aren't english, and/or contain too many special characters.
Additional subsections and headers are rejected if they likely don't correspond with natural language text. For instance, ASCII-representations of tables are rejected.
Fields without referenceable content like "author affiliations", "disclaimer", "acknowledgements", and
others are rejected.
The resulting output is a dictionary containing relevant headers as keys and subsections as values.
Coming soon.
class ParsedDocumentSchema(BaseModel):
source: str = ""
title: str = ""
text: list[str] = []
abstract: str = ""
authors: list[str] = []
publisher: str = ""
date: int | str = 0
unique_id: str = ""
doi: str = ""Development and package management is done with Pixi.
Enter the development environment with:
pixi shell -e dev
climpdf uses: