A comprehensive automated system to convert Microsoft Word documents into an interactive web-based reader with JSON content backend. Originally created for the Animal Health Handbook, this system can be adapted for any large document with chapter/section structure.
The chapter-viewer directory is a fully self-contained React application that can be used independently:
- β
Extract and use separately - Copy the
chapter-viewerfolder to create your own book viewer - β Reusable for any book - Just provide your own JSON content
- β No dependencies on parent project - All book data stored within the viewer directory
- β Ready to deploy - Complete standalone web application
- β Easy to customize - Modern React codebase with clear structure
# Copy the viewer to create your own project
cp -r chapter-viewer my-book-viewer
cd my-book-viewer
# Add your book content to book_content_json/
# (Follow the JSON format described in chapter-viewer/README.md)
# Install and run
pnpm install
pnpm devThe viewer becomes a universal book reader - perfect for documentation, handbooks, manuals, or any structured content!
See chapter-viewer/README.md for detailed standalone usage instructions.
After building your book, you can distribute the complete viewer:
# Build your book
make build
# The chapter-viewer directory is now self-contained!
# Package it for distribution:
tar -czf my-book-viewer.tar.gz chapter-viewer/
# Or just copy it anywhere:
cp -r chapter-viewer /path/to/my-book-viewer
# Recipients can use it immediately:
cd my-book-viewer
pnpm install
pnpm devThe chapter-viewer directory contains:
- β
All book content in
book_content_json/ - β
All images in
book_content_json/chapter_XX/pictures/ - β Complete React application
- β Ready to run with no external dependencies
This makes it perfect for:
- π¦ Distributing documentation as a web app
- π Hosting on GitHub Pages, Netlify, Vercel
- πΏ Sharing as an offline viewer
- π Creating multiple book viewers from one codebase
- π One-command build - Single
make buildconverts entire document - π Smart chapter detection - Automatically identifies chapters and sections
- πΌοΈ Image extraction - Extracts all images including WMF conversion
- π Table processing - Preserves complex table structures
- π¨ Format preservation - Maintains bold, italic, fonts, alignment
- π¦ 47% size optimization - Intelligent removal of redundant data
- β TOC validation - Cross-references table of contents with actual content
- π Verification tools - Built-in integrity checking
- π± React web viewer - Responsive mobile-friendly interface
# 1. Install dependencies
make install-deps
# 2. Build the book
make build
# 3. Start the web viewer
make viewerOpen your browser to http://localhost:3000 to view the book.
- Python 3.8+ with python-docx
- ImageMagick 7+ - Image processing
- Ghostscript - PDF to PNG conversion
- LibreOffice - WMF to PDF conversion
- Node.js 16+ - Web viewer
macOS:
brew install imagemagick ghostscript
brew install --cask libreoffice
make install-depsLinux (Ubuntu/Debian):
sudo apt-get install imagemagick ghostscript libreoffice python3-pip nodejs npm
make install-depsmacOS LibreOffice Setup:
If LibreOffice was installed via DMG (not Homebrew), run:
make setup-libreofficeThis creates a symlink so ImageMagick can access LibreOffice.
Word Document
β
1. Extract chapters & sections
2. Parse text with formatting
3. Extract images (WMF β PNG)
4. Process tables
5. Optimize JSON (47% reduction)
6. Build navigation index
β
Interactive Web Viewer
- Chapter Detection - Identifies chapters by N.0 headings (e.g., "1.0 Health")
- Section Splitting - Subdivides chapters into N.X sections (e.g., "1.1", "1.2")
- TOC Extraction - Extracts and validates Table of Contents
- Content Parsing - Preserves formatting, images, tables, footnotes
- WMF Conversion - Converts Windows Metafiles to PNG via LibreOffice β PDF β PNG
- JSON Optimization - Removes empty arrays, objects, default values
- Index Building - Creates navigation structure with statistics
make build # Build complete book content
make rebuild-all # Clean and rebuild from scratch
make clean # Remove generated filesmake dev # Build and start viewer in one command
make viewer # Start chapter-viewer dev server
make status # Show current project status
make stats # Display content statisticsmake check-deps # Verify all dependencies installed
make verify # Check image integrity and contentproject-root/
βββ build_book.py # Main build system (JSON output)
βββ split_chapters.py # Reference: Split to DOCX chapters
βββ split_to_md_chapters.py # Reference: Split to Markdown chapters
βββ verify_images.py # Image verification tool
βββ Makefile # Build automation
βββ setup_libreoffice.sh # LibreOffice configuration helper
βββ requirements.txt # Python dependencies
βββ LICENSE # GPL-3.0 license
β
βββ English HAH Word Apr 6 2024.docx # Source document (not in repo)
β
βββ markdown_chapters/ # Markdown export (optional, not in repo)
β βββ README.md # Navigation index
β βββ chapter_XX/ # Chapter directories
β βββ section_X_X.md # Section content
β βββ pictures/ # Extracted images
β
βββ chapter-viewer/ # STANDALONE React web application
βββ book_content_json/ # Book data (self-contained!)
β βββ index.json # Navigation index
β βββ toc_structure.json # Table of contents
β βββ chapter_XX/ # Chapter directories
β βββ chapter.json # Chapter metadata
β βββ section_XX.json # Section content
β βββ pictures/ # Chapter images
βββ src/ # React source code
βββ public/
β βββ book_content_json/ # Symlink to ../book_content_json/
βββ package.json
βββ README.md # Standalone usage guide
Each section file contains:
{
"chapter_number": 1,
"chapter_title": "1.0 HEALTH & DISEASE",
"content": [
{
"type": "paragraph",
"index": 0,
"text": "Full paragraph text",
"runs": [
{"text": "Bold text", "bold": true, "font_size": 12.0}
],
"alignment": "LEFT (0)"
},
{
"type": "table",
"rows": 3,
"cols": 2,
"cells": [...]
}
],
"statistics": {
"paragraphs": 78,
"tables": 1,
"images": 10
}
}Handles both standard chapters (N.0 format) and appendix-style chapters (starting with N.1):
- Regular chapters: Start with N.0 heading (e.g., "1.0 Introduction")
- Appendix chapters: Start with N.1 section (e.g., "24.1 Infectious Diseases")
- 29 total chapters fully detected and processed
- Extracts entire Table of Contents (433 entries)
- Excludes TOC paragraphs from actual content
- Cross-validates TOC against actual content
- Generates detailed discrepancy report
- Uses actual content titles as source of truth
Achieves 47% file size reduction by removing:
- Empty arrays:
"images": [],"footnotes": [] - Empty objects:
"formatting": {} - Empty text runs
- Common defaults:
"bold": false,"italic": false
Result: 12.8 MB β 6.8 MB (6 MB savings)
Automatically converts Windows Metafile images using the conversion chain:
WMF β LibreOffice β PDF β Ghostscript β PNG
Handles 35 WMF images (~3% of 1,066 total images).
Edit build_book.py to customize:
INPUT_DOCX = "Your-Document.docx"
JSON_DIR = "book_content_json"
ENABLE_OPTIMIZATION = True # JSON optimization
ENABLE_TOC_VALIDATION = True # TOC validationTypical results for Animal Health Handbook:
| Metric | Value |
|---|---|
| Chapters | 29 (100% detected) |
| Sections | 416 |
| Paragraphs | 12,389 |
| Tables | 70 |
| Images | 1,066 (35 WMF converted) |
| JSON Size | 6.8 MB (47% optimized) |
| Build Time | ~90-100 seconds |
# Check if LibreOffice is accessible
libreoffice --version
# If not found, configure it
make setup-libreoffice
# Rebuild
make rebuild-all# Check image integrity
make verify
# If issues found, rebuild
make rebuild-all# Check what's missing
make check-deps
# Install dependencies
make install-deps# Clean and rebuild
make clean
make build
# Force browser refresh
# Chrome/Firefox: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux)The project includes reference scripts that demonstrate the conversion logic:
Legacy script that splits the book into separate DOCX files per chapter, preserving all formatting, images, and footnotes. Useful for:
- Creating individual chapter files
- Manual editing of specific chapters
- Understanding the document structure
python3 split_chapters.py
# Output: chapters/chapter_01.docx, chapter_02.docx, etc.Reference script that converts chapters to Markdown format, demonstrating the same logic as build_book.py but generating readable .md files instead of JSON. Useful for:
- Viewing content in any Markdown viewer
- Understanding the conversion logic
- Creating documentation or exports
- Comparing with chapter-viewer output
python3 split_to_md_chapters.py
# Output: markdown_chapters/chapter_01/*.md with imagesFeatures:
- β Preserves text formatting (bold, italic, underline)
- β Converts tables to Markdown table format
- β Extracts and references images
- β Maintains chapter/section structure
- β Creates navigation indexes
- β Output viewable in any Markdown viewer
The Markdown output closely matches what you see in the chapter-viewer, making it perfect for:
- Verifying conversion accuracy
- Learning how the system processes documents
- Creating alternative export formats
- Documentation and archival purposes
To process your own Word document:
- Place your
.docxfile in the project root - Update
INPUT_DOCXinbuild_book.py(or reference scripts) - Adjust chapter detection patterns if needed (see
is_chapter_heading()) - Run
make rebuild-all
For debugging or compatibility:
# In build_book.py
ENABLE_OPTIMIZATION = False
ENABLE_TOC_VALIDATION = FalseAfter build, check:
chapter-viewer/book_content_json/toc_validation_report.json- TOC discrepancieschapter-viewer/book_content_json/toc_structure.json- Extracted TOC
- WMF_CONVERSION_GUIDE.md - Image conversion guide
- MARKDOWN_EXPORT_GUIDE.md - Markdown export reference guide
- chapter-viewer/README.md - Web viewer documentation
- CONTRIBUTING.md - Contribution guidelines
- split_chapters.py - Split book into DOCX chapter files
- split_to_md_chapters.py - Convert chapters to Markdown format
- build_book.py - Main build system (JSON output)
- verify_images.py - Image verification tool
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run
make verifyto check integrity - Submit a pull request
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
This means you can:
- β Use commercially
- β Modify the code
- β Distribute
- β Use privately
Under the conditions:
- π Disclose source
- π License and copyright notice
- π Same license for derivatives
- π State changes made
See LICENSE file for full details.
- Original development for Animal Health Handbook document conversion
- Authors: Dr. Peter Quesenberry and Dr. Maureen Birmingham (original handbook)
- python-docx - Word document parsing
- ImageMagick - Image processing
- LibreOffice - Document conversion
- React - Web viewer interface
- Vite - Build tooling
For issues, questions, or suggestions:
- Check the troubleshooting section above
- Review existing issues on GitHub
- Create a new issue with:
- System information (OS, Python version, etc.)
- Output of
make check-deps - Error messages or unexpected behavior
- Steps to reproduce
Potential future enhancements:
- Support for more document formats (PDF, EPUB input)
- Full-text search in viewer
- Export to EPUB/PDF from JSON
- More aggressive image optimization
- Multi-language support
- Cloud deployment guides
- Docker containerization
Note: This repository does not include the source Word document or generated content. You'll need to provide your own document to process.