A set of tools to scrape, inventory, and analyze files related to the Jeffrey Epstein case released by the Department of Justice.
The project includes a robust scraping script scrape_epstein.py designed to fetch all documents and media files from https://www.justice.gov/epstein.
- Comprehensive Crawl: Recursively finds files in subsections like Court Records and FOIA (FBI, BOP).
- Bot Protection Bypass: Uses
playwright-stealthand user-like behavior to navigate Akamai protections. - Resumable: Maintains a local
epstein_files/inventory.jsondatabase. If the script is interrupted, simply run it again to pick up exactly where it left off. - Media Support: Downloads PDFs, ZIPs, as well as media files like
.wav,.mp3, and.mp4. - Collision Handling: Automatically renames duplicate filenames (e.g.
file_1.pdf) so no data is overwritten or lost.
-
Install Dependencies
pip install playwright playwright-stealth pymupdf playwright install chromium
-
Run Scraper
python scrape_epstein.py
The script will:
- Create an
epstein_files/directory. - Crawl the Justice.gov pages.
- Populate
epstein_files/inventory.json. - Download all new files.
- Create an
-
Classify Files (Optional but Recommended)
python classify_files.py
This script analyzes downloaded PDFs to determine if they are Text (searchable) or Scanned (images). It updates
epstein_files/inventory.jsonwith this classification, enabling targeted OCR processing. -
Extract Content
python extract_content.py
Extracts embedded images and text from the PDFs into dedicated subdirectories (e.g.,
epstein_files/001/images/). -
Process Images
python process_images.py [--overwrite] [--just documents|extracted]Generates web-optimized AVIF derivatives for all images and PDFs found in the inventory.
- Documents (PDFs): Generates a lightweight preview (
medium.avifat 800px, Page 1 only) and aninfo.jsonwith metadata. - Extracted Images: Generates sized derivatives (tiny, thumb, small, medium, full).
- Flags:
--overwrite: Force regeneration of existing files (useful for applying new quality settings).--just: Limit scope todocuments(PDFs only) orextracted(Images only).
- Documents (PDFs): Generates a lightweight preview (
-
Extract Metadata
python extract_metadata.py
Extracts embedded EXIF and XMP metadata from all images and PDFs in the inventory.
- Output: Creates a
meta.jsonfile in the image's or document's directory containing the raw metadata. - PDF Support: Extracts XMP, Standard Info, Layers (OCGs), Fonts, Embedded Files, and Annotation summaries.
- Output: Creates a
-
Image Analysis
python analyze_images.py [--overwrite]
Uses a local LLM to analyze extracted images and generate structured JSON descriptions (
type,objects,ocr_needed, etc.).Requirements:
- Vision-capable model loaded (e.g.,
mistralai/ministral-3-3borllava).
- Vision-capable model loaded (e.g.,
-
Perform OCR
python perform_ocr.py [--dry-run]
Walks through the
epstein_filesdirectory and performs OCR on images flagged with"needs_ocr": truein theiranalysis.jsonfile.Features:
- Smart Selection: Prioritizes original high-quality images (
.png/.jpg) over compressed.avifif available. - Auto-Resize: Automatically resizes images larger than 2048px to prevent API errors.
- Resumable: Skips directories where
ocr.txtalready exists. - Dry Run: Use
--dry-runto see what files would be processed without making API calls.
Requirements:
- LM Studio running on
http://localhost:1234(or configured URL). - An OCR-capable model loaded (recommended:
allenai/olmocr-2-7b).
- Smart Selection: Prioritizes original high-quality images (
-
Perform PDF OCR
python perform_pdf_ocr.py [--dry-run] [--overwrite]
Performs page-by-page OCR on the full PDF documents using LM Studio. This is useful for documents that are scanned images without embedded text.
- Features:
- Renders each page to a high-quality PNG (1288px max dimension).
- Sends page + expert prompt to LM Studio.
- Aggregates pages into a single
ocr.mdmarkdown file.
- Requirements: Same as Image OCR (LM Studio + Vision Model).
- Features:
-
Transcribe Media
python transcribe_media.py [--model large-v2] [--device cpu|cuda]Transcribes audio/video files (mp3, wav, mp4, etc.) found in the inventory using WhisperX. It generates a
.vttsubtitle file next to the media file.Requirements:
- FFmpeg must be installed and on your system PATH.
- WhisperX:
pip install git+https://github.com/m-bain/whisperX.git
- HuggingFace Token (Optional): Set
HF_TOKENin.envfor speaker diarization (requires accepting pyannote terms).
-
Detect Faces
python detect_faces.py [--overwrite]
Scans all images in the inventory for faces using
insightface.- Features:
- Detects bounding boxes, landmarks, and extracts embeddings for facial recognition/clustering.
- Saves results to
faces.jsonin the image's directory. - Ignores
has_facesflag from analysis (processes everything) for maximum coverage.
- Requirements:
insightfaceandonnxruntimeinstalled (included in requirements.txt).
- Features:
-
Ingest to Firebase
python ingest_to_firebase.py [--only documents|images|faces] [--force]
Populates a Firestore database with the processed data.
- Documents: Uploads PDF previews and metadata to the
documentscollection. - Images: Uploads extracted photo previews and metadata to the
imagescollection. - Faces: NEW! Ingests detected faces and vector embeddings to the
facescollection.- Vector Search: Uses Firestore Vector Search. You must create the index first:
(Note: Dimension is 512 for the default
gcloud firestore indexes composite create \ --collection-group=faces \ --query-scope=COLLECTION \ --field-config field-path=embedding,vector-config='{"dimension":"512", "flat": "{}"}' \ --database="(default)" \ --project=epstein-file-browser
buffalo_lmodel).
- Vector Search: Uses Firestore Vector Search. You must create the index first:
- Documents: Uploads PDF previews and metadata to the
The epstein_files/ directory is organized by document ID. After running all steps, a typical directory looks like:
epstein_files/
├── 001/
│ ├── 001.pdf # Original file
│ ├── content.txt # Extracted text content
│ └── images/
│ ├── page1_img1.jpg # Original extracted image
│ └── page1_img1/ # Analysis & Formats Directory
│ ├── analysis.json # AI Analysis (Type, Description, Objects)
│ ├── meta.json # EXIF/XMP Metadata
│ ├── ocr.txt # OCR text (if text was detected)
│ ├── full.avif # Web-optimized full resolution
│ ├── medium.avif # Medium sized thumbnail
│ ├── small.avif # Small sized thumbnail
│ ├── thumb.avif # Thumbnail
│ └── tiny.avif # Tiny placeholder
├── 002/
...
The project includes a modern Next.js web application to browse and search the ingested documents.
- Node.js: Install Node.js (v18 or newer recommended).
-
Navigate to the site directory
cd site -
Install Dependencies
npm install
-
Run Development Server
npm run dev
The site will be available at
http://localhost:3000.
- Document Browser: Filter by extracted entities, dates, or search text (using Firestore).
- Vector Search: (Planned) Search for faces or semantic concepts.
- Viewer: Markdown-rendered content and high-quality deep-zoom images.