Epstein Assist

A set of tools to scrape, inventory, and analyze files related to the Jeffrey Epstein case released by the Department of Justice.

Scraper

The project includes a robust scraping script scrape_epstein.py designed to fetch all documents and media files from https://www.justice.gov/epstein.

Features

Comprehensive Crawl: Recursively finds files in subsections like Court Records and FOIA (FBI, BOP).
Bot Protection Bypass: Uses playwright-stealth and user-like behavior to navigate Akamai protections.
Resumable: Maintains a local epstein_files/inventory.json database. If the script is interrupted, simply run it again to pick up exactly where it left off.
Media Support: Downloads PDFs, ZIPs, as well as media files like .wav, .mp3, and .mp4.
Collision Handling: Automatically renames duplicate filenames (e.g. file_1.pdf) so no data is overwritten or lost.

Usage

Install Dependencies

pip install playwright playwright-stealth pymupdf
playwright install chromium

Run Scraper
```
python scrape_epstein.py
```
The script will:
- Create an epstein_files/ directory.
- Crawl the Justice.gov pages.
- Populate epstein_files/inventory.json.
- Download all new files.
Classify Files (Optional but Recommended)
```
python classify_files.py
```
This script analyzes downloaded PDFs to determine if they are Text (searchable) or Scanned (images). It updates epstein_files/inventory.json with this classification, enabling targeted OCR processing.
Extract Content
```
python extract_content.py
```
Extracts embedded images and text from the PDFs into dedicated subdirectories (e.g., epstein_files/001/images/).
Process Images
```
python process_images.py [--overwrite] [--just documents|extracted]
```
Generates web-optimized AVIF derivatives for all images and PDFs found in the inventory.
- Documents (PDFs): Generates a lightweight preview (medium.avif at 800px, Page 1 only) and an info.json with metadata.
- Extracted Images: Generates sized derivatives (tiny, thumb, small, medium, full).
- Flags:
  - --overwrite: Force regeneration of existing files (useful for applying new quality settings).
  - --just: Limit scope to documents (PDFs only) or extracted (Images only).
Extract Metadata
```
python extract_metadata.py
```
Extracts embedded EXIF and XMP metadata from all images and PDFs in the inventory.
- Output: Creates a meta.json file in the image's or document's directory containing the raw metadata.
- PDF Support: Extracts XMP, Standard Info, Layers (OCGs), Fonts, Embedded Files, and Annotation summaries.
Image Analysis
```
python analyze_images.py [--overwrite]
```
Uses a local LLM to analyze extracted images and generate structured JSON descriptions (type, objects, ocr_needed, etc.).

Requirements:
- Vision-capable model loaded (e.g., mistralai/ministral-3-3b or llava).
Perform OCR
```
python perform_ocr.py [--dry-run]
```
Walks through the epstein_files directory and performs OCR on images flagged with "needs_ocr": true in their analysis.json file.

Features:
- Smart Selection: Prioritizes original high-quality images (.png/.jpg) over compressed .avif if available.
- Auto-Resize: Automatically resizes images larger than 2048px to prevent API errors.
- Resumable: Skips directories where ocr.txt already exists.
- Dry Run: Use --dry-run to see what files would be processed without making API calls.
Requirements:
- LM Studio running on http://localhost:1234 (or configured URL).
- An OCR-capable model loaded (recommended: allenai/olmocr-2-7b).
Perform PDF OCR
```
python perform_pdf_ocr.py [--dry-run] [--overwrite]
```
Performs page-by-page OCR on the full PDF documents using LM Studio. This is useful for documents that are scanned images without embedded text.
- Features:
  - Renders each page to a high-quality PNG (1288px max dimension).
  - Sends page + expert prompt to LM Studio.
  - Aggregates pages into a single ocr.md markdown file.
- Requirements: Same as Image OCR (LM Studio + Vision Model).
Transcribe Media
```
python transcribe_media.py [--model large-v2] [--device cpu|cuda]
```
Transcribes audio/video files (mp3, wav, mp4, etc.) found in the inventory using WhisperX. It generates a .vtt subtitle file next to the media file.

Requirements:
- FFmpeg must be installed and on your system PATH.
- WhisperX:
```
pip install git+https://github.com/m-bain/whisperX.git
```
- HuggingFace Token (Optional): Set HF_TOKEN in .env for speaker diarization (requires accepting pyannote terms).
Detect Faces
```
python detect_faces.py [--overwrite]
```
Scans all images in the inventory for faces using insightface.
- Features:
  - Detects bounding boxes, landmarks, and extracts embeddings for facial recognition/clustering.
  - Saves results to faces.json in the image's directory.
  - Ignores has_faces flag from analysis (processes everything) for maximum coverage.
- Requirements:
  - insightface and onnxruntime installed (included in requirements.txt).
Ingest to Firebase
```
python ingest_to_firebase.py [--only documents|images|faces] [--force]
```
Populates a Firestore database with the processed data.
- Documents: Uploads PDF previews and metadata to the documents collection.
- Images: Uploads extracted photo previews and metadata to the images collection.
- Faces: NEW! Ingests detected faces and vector embeddings to the faces collection.
  - Vector Search: Uses Firestore Vector Search. You must create the index first:
```
gcloud firestore indexes composite create \
--collection-group=faces \
--query-scope=COLLECTION \
--field-config field-path=embedding,vector-config='{"dimension":"512", "flat": "{}"}' \
--database="(default)" \
--project=epstein-file-browser
```
    (Note: Dimension is 512 for the default buffalo_l model).

Output Structure

The epstein_files/ directory is organized by document ID. After running all steps, a typical directory looks like:

epstein_files/
├── 001/
│   ├── 001.pdf                  # Original file
│   ├── content.txt              # Extracted text content
│   └── images/
│       ├── page1_img1.jpg       # Original extracted image
│       └── page1_img1/          # Analysis & Formats Directory
│           ├── analysis.json    # AI Analysis (Type, Description, Objects)
│           ├── meta.json        # EXIF/XMP Metadata
│           ├── ocr.txt          # OCR text (if text was detected)
│           ├── full.avif        # Web-optimized full resolution
│           ├── medium.avif      # Medium sized thumbnail
│           ├── small.avif       # Small sized thumbnail
│           ├── thumb.avif       # Thumbnail
│           └── tiny.avif        # Tiny placeholder
├── 002/
...

Web Application

The project includes a modern Next.js web application to browse and search the ingested documents.

Prerequisites

Node.js: Install Node.js (v18 or newer recommended).

Setup

Navigate to the site directory
```
cd site
```
Install Dependencies
```
npm install
```
Run Development Server
```
npm run dev
```
The site will be available at http://localhost:3000.

Features

Document Browser: Filter by extracted entities, dates, or search text (using Firestore).
Vector Search: (Planned) Search for faces or semantic concepts.
Viewer: Markdown-rendered content and high-quality deep-zoom images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Epstein Assist

Scraper

Features

Usage

Output Structure

Web Application

Prerequisites

Setup

Features

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
__pycache__		__pycache__
site		site
.gitignore		.gitignore
README.md		README.md
analyze_images.py		analyze_images.py
check_face_data.py		check_face_data.py
classify_files.py		classify_files.py
cleanup_analysis.sh		cleanup_analysis.sh
configure_cors.py		configure_cors.py
detect_faces.py		detect_faces.py
extract_content.py		extract_content.py
extract_metadata.py		extract_metadata.py
filter_photos.py		filter_photos.py
ingest_to_firebase.py		ingest_to_firebase.py
package-lock.json		package-lock.json
perform_ocr.py		perform_ocr.py
perform_pdf_ocr.py		perform_pdf_ocr.py
process_images.py		process_images.py
repair_inventory.py		repair_inventory.py
requirements.txt		requirements.txt
scrape_epstein.py		scrape_epstein.py
serviceAccountKey.dummy.json		serviceAccountKey.dummy.json
transcribe_media.py		transcribe_media.py

dougchestnut/Epstein-Assistant

Folders and files

Latest commit

History

Repository files navigation

Epstein Assist

Scraper

Features

Usage

Output Structure

Web Application

Prerequisites

Setup

Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages