A comprehensive Python tool for extracting text and images from various document formats while preserving structure. Optimized for Retrieval Augmented Generation (RAG) systems and knowledge base construction with advanced OCR post-processing for martial arts and multilingual content.
- Multi-format support: PDF, EPUB, DOCX, DJVU
- RAG-optimized output: Intelligent chunking and quality scoring for embedding systems
- Conservative OCR post-processing: Domain-aware text correction without aggressive spell checking
- Martial arts content detection: Specialized recognition of techniques, terminology, and multilingual content
- Structure preservation: Maintains headings, chapters, paragraphs, tables, and captions
- Image extraction: Extracts all images with metadata and proper naming
- Computer vision image detection: Advanced CV-based extraction of embedded images from scanned pages
- OCR capability: Full OCR support with confidence scoring and quality metrics
- Batch processing: Process single files or entire directories
- Structured output: JSON, human-readable text, and RAG-ready formats
- Comprehensive metadata: Author, title, page count, and format-specific information
# Install Python and pip
sudo apt update
sudo apt install python3 python3-pip
# Install DJVU tools (for DJVU support)
sudo apt install djvulibre-bin
# Install Tesseract OCR (for DJVU text extraction)
sudo apt install tesseract-ocr tesseract-ocr-eng
# Optional: Additional language packs
sudo apt install tesseract-ocr-chi-sim tesseract-ocr-jpn tesseract-ocr-kor# Using Homebrew
brew install djvulibre tesseract
# Using MacPorts
sudo port install djvulibre tesseractInstall required Python packages:
# Using pip
pip install -r requirements.txt
# Or install individually
pip install PyPDF2 PyMuPDF pdfplumber ebooklib python-docx beautifulsoup4 pytesseract Pillow lxml opencv-python numpyOr use the provided installation script:
chmod +x install_dependencies.sh
./install_dependencies.shpython3 document_extractor.py "/path/to/document.pdf"python3 document_extractor.py "/path/to/documents/directory"python3 document_extractor.py "/path/to/document.pdf" -o "/path/to/output"python3 document_extractor.py "/path/to/document.pdf" -vpython3 document_extractor.py "/path/to/document.pdf" --rag-mode# Create an example configuration file
python3 document_extractor.py --create-config my_config.json
# Use configuration file
python3 document_extractor.py "/path/to/document.pdf" --config my_config.jsonpython3 document_extractor.py "/path/to/document.pdf" --no-embedded-images# Default: 400 DPI, English + Chinese (Traditional & Simplified), PSM 3
python3 document_extractor.py "/path/to/document.djvu"
# English only for faster processing
python3 document_extractor.py "/path/to/document.djvu" --ocr-lang eng
# Maximum quality OCR with RAG optimization
python3 document_extractor.py "/path/to/document.djvu" --ocr-dpi 600 --rag-mode# Default: 75000 area, 250x250 min size, 4:1 max aspect ratio
python3 image_extractor.py "extracted_documents/Document_Name/images"
# More sensitive detection (smaller images)
python3 image_extractor.py "extracted_documents/Document_Name/images" \
--min-area 50000 --min-width 200 --min-height 200 -o "reextracted_images"from document_extractor import DocumentExtractor
# Initialize extractor with improved defaults
# (400 DPI, English + Chinese, less sensitive image extraction)
extractor = DocumentExtractor(output_dir="my_extracts")
# English only for faster processing
extractor = DocumentExtractor(output_dir="my_extracts", ocr_lang="eng")
# Maximum quality OCR
extractor = DocumentExtractor(
output_dir="my_extracts",
ocr_dpi=600,
ocr_lang="eng+chi_sim+chi_tra"
)
# Extract a single document
doc_structure = extractor.extract_document("/path/to/document.pdf")
# Save results (standard format)
extractor.save_structure(doc_structure, Path("output/document_name"))
# Save with RAG optimization (recommended for knowledge bases)
extractor.save_structure_for_rag(doc_structure, Path("output/document_name"))The document extractor now uses a centralized configuration system that eliminates magic numbers and provides clear documentation for all parameters.
Generate an example configuration file with all available options:
python3 document_extractor.py --create-config my_config.jsonThis creates a JSON file with all configuration options and their default values:
{
"_comment": "Document Extractor Configuration",
"output_dir": "extracted_documents",
"extract_embedded_images": true,
"verbose_logging": false,
"ocr": {
"_comment": "OCR processing settings",
"dpi": 400,
"language": "eng+chi_sim+chi_tra",
"page_segmentation_mode": 3,
"low_confidence_threshold": 60
},
"image_detection": {
"_comment": "Computer vision image detection settings",
"min_area": 75000,
"min_width": 250,
"min_height": 250,
"canny_low_threshold": 100,
"canny_high_threshold": 200
},
"rag": {
"_comment": "RAG system optimization settings",
"target_chunk_size": 512,
"min_quality_score": 0.6,
"min_confidence_score": 0.7
}
}# Use configuration file
python3 document_extractor.py document.pdf --config my_config.json
# Override specific values via CLI
python3 document_extractor.py document.pdf --config my_config.json --ocr-dpi 600- dpi: OCR processing resolution (72-1200)
- language: Tesseract language codes
- page_segmentation_mode: Text layout analysis mode (0-13)
- low_confidence_threshold: Words below this flagged for review (0-100)
- enable_preprocessing: Apply image enhancement before OCR
- min_area: Minimum pixel area for detected images
- min_width/min_height: Minimum dimensions in pixels
- canny_low/high_threshold: Edge detection sensitivity
- variance_threshold: Minimum pixel variance for images
- edge_density_min/max: Edge density range for image classification
- target_chunk_size: Target tokens per chunk (for embeddings)
- min_quality_score: Minimum quality for embedding chunks (0-1)
- min_confidence_score: Minimum OCR confidence for high-quality chunks (0-1)
- max_low_confidence_regions: Max uncertain words before flagging for review
- technique_relevance_weight: Weight for technique mentions in scoring
- chinese_relevance_weight: Weight for Chinese character ratio
- technique_patterns: Regex patterns for technique detection
- min_clean_char_ratio: Minimum ratio of valid characters (0-1)
- optimal_text_length: Text length that gets quality bonus
- max_char_repetition: Maximum allowed character repetition
from extractor_config import DocumentExtractorConfig
# Load from file
config = DocumentExtractorConfig.from_file("my_config.json")
# Create custom configuration
config = DocumentExtractorConfig()
config.ocr.dpi = 600
config.image_detection.min_area = 50000
config.rag.target_chunk_size = 256
# Use with extractor
extractor = DocumentExtractor(config=config)The configuration system includes validation to catch common errors:
# If you set invalid values, you'll get helpful error messages
# Configuration validation errors:
# - OCR DPI should be between 72 and 1200
# - Minimum aspect ratio should be less than maximum aspect ratio
# - Quality score should be between 0 and 1Each processed document creates a subdirectory with the following structure:
document_name/
├── structure.json # Complete structured data
├── content.txt # Human-readable text content
├── images/ # Extracted images directory
│ ├── page_001.png
│ ├── page_002.png
│ └── ...
└── images_manifest.json # Image metadata
document_name/
├── structure.json # Complete structured data
├── content.txt # Human-readable text content
├── rag_ready.json # RAG-optimized content with chunking
├── embeddings_chunks.jsonl # High-quality chunks for embedding
├── images/ # Extracted images directory
│ ├── page_001.png
│ ├── page_002.png
│ └── ...
└── images_manifest.json # Image metadata
The structure.json file contains:
{
"title": "Document Title",
"author": "Author Name",
"format": "PDF",
"pages": 150,
"chapters": ["Chapter 1", "Chapter 2", ...],
"content": [
{
"text": "Content text",
"content_type": "heading|paragraph|table|caption",
"level": 1,
"page_number": 1,
"chapter": "Chapter Name",
"confidence_score": 0.95,
"quality_score": 0.87,
"technique_mentions": ["White Crane Spreads Wings", "Single Whip"],
"low_confidence_regions": ["unclear_word1", "unclear_word2"],
"metadata": {...}
}
],
"images": [
{
"filename": "page_001.png",
"format": "PNG",
"width": 800,
"height": 600,
"page_number": 1,
"caption": "Image caption",
"image_type": "illustration",
"technique_demonstrations": ["Tiger Claw stance"],
"relevance_score": 0.85,
"metadata": {...}
}
],
"metadata": {...}
}The rag_ready.json file contains optimized content for RAG systems:
{
"document_metadata": {
"title": "Document Title",
"total_chunks": 45,
"high_quality_chunks": 38,
"total_techniques": 127,
"relevant_images": 23
},
"rag_chunks": [
{
"chunk_id": "chunk_0001",
"text": "Semantic chunk text optimized for embedding...",
"context": "Chapter 3: Advanced Techniques",
"technique_mentions": ["Iron Palm", "Golden Bell"],
"confidence_score": 0.89,
"quality_score": 0.92,
"has_techniques": true,
"chinese_ratio": 0.15,
"page_numbers": [12, 13]
}
],
"processed_images": [
{
"filename": "page_012_img_001.png",
"image_type": "illustration",
"ocr_text": "Figure 3.1: Iron Palm Training",
"technique_demonstrations": ["Iron Palm"],
"relevance_score": 0.91
}
],
"technique_index": {
"Iron Palm": [
{"chunk_id": "chunk_0001", "context": "Training Methods"},
{"chunk_id": "chunk_0023", "context": "Applications"}
]
},
"low_confidence_content": [
{
"chunk_id": "chunk_0015",
"confidence_score": 0.63,
"low_confidence_regions": ["unclear_term1", "unclear_term2"]
}
]
}The extractor includes domain-aware OCR post-processing designed for martial arts and multilingual content:
- Visual OCR errors: Fixes common misrecognitions like
rn→m,cl→d,li→h - Character repetition: Reduces excessive character duplication from OCR artifacts
- Whitespace normalization: Cleans up irregular spacing without affecting content
- Garbage character removal: Removes OCR artifacts while preserving Chinese characters
- Per-word confidence scoring: Uses Tesseract confidence data to identify uncertain regions
- Low-confidence flagging: Marks words with <60% confidence for manual review
- Quality metrics: Calculates overall text quality based on multiple factors
- Martial arts terminology preservation: Avoids "correcting" valid technique names and terminology
- Multilingual content support: Handles mixed English/Chinese text appropriately
- Technical vocabulary protection: Preserves specialized terms, proper nouns, and transliterations
- Technique preservation: Keeps complete technique descriptions together
- Context-aware splitting: Maintains semantic coherence across chunk boundaries
- Natural breakpoints: Splits at headings, tables, and content type changes
- Size optimization: Targets 512-token chunks for optimal embedding performance
- High-quality chunks: Filters content with quality scores >0.6 for embedding
- Confidence thresholds: Excludes low-confidence OCR content from primary embeddings
- Content type weighting: Prioritizes headings and technique descriptions
- Review flagging: Identifies content needing manual verification
- Technique names: Detects patterns like "White Crane Spreads Wings", "Iron Palm"
- Terminology identification: Recognizes martial arts vocabulary and concepts
- Multilingual support: Handles Chinese terms, transliterations, and English descriptions
- Context analysis: Uses surrounding text to improve detection accuracy
- Cross-referencing: Maps techniques to multiple mentions across the document
- Context preservation: Associates techniques with their instructional context
- Hierarchical organization: Links techniques to chapters and sections
- Image classification: Categorizes images as photos, illustrations, diagrams, or icons
- OCR text extraction: Extracts text from technical diagrams and illustrations
- Technique demonstration detection: Identifies images showing martial arts techniques
- Relevance scoring: Ranks images by importance for martial arts knowledge
- Embedding preparation: Structured data ready for multimodal embedding models
- Caption enhancement: Combines original captions with OCR text and technique detection
- Cross-reference linking: Connects images to related text chunks
# Extract with full RAG optimization
python3 document_extractor.py "Kung_Fu_Manual.pdf" --rag-mode
# High-quality OCR with RAG chunking
python3 document_extractor.py "Martial_Arts_Book.djvu" --ocr-dpi 600 --rag-mode# Extract and process for knowledge base
extractor = DocumentExtractor(output_dir="knowledge_base")
doc_structure = extractor.extract_document("martial_arts_text.pdf")
extractor.save_structure_for_rag(doc_structure, output_dir)
# Access RAG-ready data
import json
with open("output_dir/rag_ready.json") as f:
rag_data = json.load(f)
# High-quality chunks for embedding
chunks = rag_data['rag_chunks']
techniques = rag_data['technique_index']
relevant_images = [img for img in rag_data['processed_images']
if img['relevance_score'] > 0.7]- Text extraction: Using PyMuPDF and PyPDF2
- Image extraction: All embedded images with metadata
- Computer vision image detection: Automatic detection and extraction of images from scanned pages
- Table extraction: Using pdfplumber for enhanced table detection
- Structure detection: Automatic heading and paragraph classification
- Font analysis: Content type classification based on font size and style
- Scanned page detection: Automatic OCR application for image-based pages
- Chapter extraction: Automatic chapter detection from navigation
- HTML processing: Clean text extraction from HTML content
- Image extraction: All embedded images (PNG, JPG, SVG)
- Metadata extraction: Title, author, publisher information
- Structure preservation: Headings, paragraphs, and content hierarchy
- Text extraction: Full document text with formatting
- Table extraction: Complete table structure preservation
- Image extraction: All embedded images and media
- Style analysis: Heading detection based on document styles
- Metadata extraction: Document properties and creation info
- OCR text extraction: Full text via Tesseract OCR
- Page conversion: Convert all pages to PNG images
- Computer vision image detection: Automatic extraction of images embedded within pages
- Batch processing: Efficient handling of multi-page documents
- Error handling: Graceful handling of OCR failures
The extractor classifies content into the following types:
- heading: Document headings (levels 1-6)
- paragraph: Regular text paragraphs
- table: Structured table data
- caption: Image and figure captions
- error: Error messages and warnings
The extractor now includes advanced computer vision capabilities to detect and extract images from scanned pages. This feature uses OpenCV to:
- Detect rectangular regions: Identify potential image boundaries using edge detection
- Filter text regions: Distinguish between images and text using statistical analysis
- Extract clean images: Save detected images as separate PNG files
- Preserve metadata: Include extraction method, bounding boxes, and quality metrics
For overly aggressive extraction, you can:
-
Disable the feature entirely:
python3 document_extractor.py document.djvu --no-embedded-images
-
Adjust sensitivity in code by modifying these parameters in
_detect_and_extract_images_from_page():min_area: Minimum pixel area for detected images (default: 75000)Canny edge detection: Edge detection thresholds (default: 100, 200)Size filters: Minimum width/height (default: 250px)
-
Customize detection logic in
_is_likely_image_region():variance: Pixel variance threshold for image detectionedge_density: Edge density ranges for filteringentropy: Histogram entropy thresholds
The extractor includes several OCR enhancements for better text recognition:
- Gaussian blur: Reduces noise in scanned images
- Adaptive thresholding: Improves text contrast for varying lighting
- Morphological operations: Cleans up text and connects broken characters
- DPI scaling: Optimizes image resolution for OCR processing
- Language support: Multi-language OCR with language packs
- Page segmentation modes: Different strategies for text layout analysis
- OCR engine modes: Choose between legacy and neural network engines
- DPI optimization: Configurable resolution for quality vs speed trade-offs
# Default: High-quality OCR with 400 DPI, English + Chinese
python3 document_extractor.py document.djvu
# English only for faster processing
python3 document_extractor.py document.djvu --ocr-lang eng
# Maximum quality OCR
python3 document_extractor.py document.djvu --ocr-dpi 600
# Different page segmentation mode for complex layouts
python3 document_extractor.py document.djvu --ocr-psm 1
# Disable preprocessing if it's causing issues
python3 document_extractor.py document.djvu --no-ocr-preprocessingConfigure Tesseract for different languages:
# Install additional language packs
sudo apt install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa
sudo apt install tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor
# Use multiple languages
python3 document_extractor.py document.djvu --ocr-lang fra+eng+deuFor situations where you want to re-process image extraction with different parameters:
# Re-extract images with improved defaults (less sensitive)
python3 image_extractor.py "extracted_documents/Document_Name/images"
# More sensitive detection (smaller images)
python3 image_extractor.py "extracted_documents/Document_Name/images" \
--min-area 50000 \
--min-width 200 \
--min-height 200 \
--max-aspect-ratio 5.0 \
-o "reextracted_images"The standalone tool allows you to:
- Adjust detection sensitivity without re-running OCR
- Fine-tune parameters for specific document types
- Experiment with settings to find optimal extraction
- Process existing extractions with new algorithms
Extend the save_structure method to create custom output formats:
def custom_save_format(doc_structure, output_dir):
# Custom processing logic
passThe extractor includes comprehensive error handling:
- Missing dependencies: Clear error messages with installation instructions
- Corrupted files: Graceful handling of damaged documents
- OCR failures: Fallback to image-only extraction
- Format detection: Automatic format detection with validation
- Large documents: Processing time scales with document size
- DJVU files: OCR processing is CPU-intensive
- Batch processing: Memory usage increases with concurrent extractions
- Image extraction: Disk space requirements for image-heavy documents
-
"tesseract is not installed"
sudo apt install tesseract-ocr
-
"djvused: command not found"
sudo apt install djvulibre-bin
-
"Cannot import name 'CT_Table'"
- Update python-docx:
pip install --upgrade python-docx
- Update python-docx:
-
Memory errors with large files
- Process files individually rather than in batches
- Increase system memory or use swap space
Enable verbose logging for detailed processing information:
python3 document_extractor.py document.pdf -v- PyPDF2: PDF text extraction
- PyMuPDF (fitz): Advanced PDF processing
- pdfplumber: PDF table extraction
- ebooklib: EPUB processing
- python-docx: DOCX document handling
- beautifulsoup4: HTML parsing
- pytesseract: OCR text extraction
- Pillow: Image processing
- lxml: XML processing
- opencv-python: Computer vision for image detection
- numpy: Numerical computing for image processing
- Python 3.8+
- Linux/macOS/Windows
- 2GB+ RAM (recommended for large documents)
- djvulibre-bin (for DJVU support)
- tesseract-ocr (for OCR functionality)
This project is open source. See the LICENSE file for details.
Contributions are welcome! Please feel free to submit pull requests or open issues for bug reports and feature requests.
For issues and questions:
- Check the troubleshooting section
- Search existing issues
- Create a new issue with detailed information
python3 document_extractor.py "Shaolin Kung Fu Manual.pdf"python3 document_extractor.py "/home/user/papers" -o "/home/user/extracted_papers"extractor = DocumentExtractor(output_dir="custom_output")
doc = extractor.extract_document("document.epub")
print(f"Extracted {len(doc.content)} content items")
print(f"Found {len(doc.images)} images")- v1.0.0: Initial release with PDF, EPUB, DOCX, DJVU support
- v1.1.0: Added table extraction and improved structure detection
- v1.2.0: Enhanced OCR capabilities and error handling
- v1.3.0: Added computer vision-based image detection and extraction from scanned pages
- v2.0.0: RAG System Optimization Release
- Conservative OCR post-processing for domain-specific content
- Martial arts terminology detection and preservation
- Intelligent chunking for embedding systems
- Confidence scoring and quality assessment
- RAG-ready output formats with technique indexing
- Multimodal image processing with relevance scoring
- Knowledge base construction optimization