Document Conversion System

A comprehensive automated system to convert Microsoft Word documents into an interactive web-based reader with JSON content backend. Originally created for the Animal Health Handbook, this system can be adapted for any large document with chapter/section structure.

🎯 Standalone Chapter Viewer

The chapter-viewer directory is a fully self-contained React application that can be used independently:

✅ Extract and use separately - Copy the chapter-viewer folder to create your own book viewer
✅ Reusable for any book - Just provide your own JSON content
✅ No dependencies on parent project - All book data stored within the viewer directory
✅ Ready to deploy - Complete standalone web application
✅ Easy to customize - Modern React codebase with clear structure

Using the Viewer Standalone

# Copy the viewer to create your own project
cp -r chapter-viewer my-book-viewer
cd my-book-viewer

# Add your book content to book_content_json/
# (Follow the JSON format described in chapter-viewer/README.md)

# Install and run
pnpm install
pnpm dev

The viewer becomes a universal book reader - perfect for documentation, handbooks, manuals, or any structured content!

See chapter-viewer/README.md for detailed standalone usage instructions.

Distributing Your Book as a Standalone Viewer

After building your book, you can distribute the complete viewer:

# Build your book
make build

# The chapter-viewer directory is now self-contained!
# Package it for distribution:
tar -czf my-book-viewer.tar.gz chapter-viewer/

# Or just copy it anywhere:
cp -r chapter-viewer /path/to/my-book-viewer

# Recipients can use it immediately:
cd my-book-viewer
pnpm install
pnpm dev

The chapter-viewer directory contains:

✅ All book content in book_content_json/
✅ All images in book_content_json/chapter_XX/pictures/
✅ Complete React application
✅ Ready to run with no external dependencies

This makes it perfect for:

📦 Distributing documentation as a web app
🌐 Hosting on GitHub Pages, Netlify, Vercel
💿 Sharing as an offline viewer
📚 Creating multiple book viewers from one codebase

Features

🚀 One-command build - Single make build converts entire document
📚 Smart chapter detection - Automatically identifies chapters and sections
🖼️ Image extraction - Extracts all images including WMF conversion
📊 Table processing - Preserves complex table structures
🎨 Format preservation - Maintains bold, italic, fonts, alignment
📦 47% size optimization - Intelligent removal of redundant data
✅ TOC validation - Cross-references table of contents with actual content
🔍 Verification tools - Built-in integrity checking
📱 React web viewer - Responsive mobile-friendly interface

Quick Start

# 1. Install dependencies
make install-deps

# 2. Build the book
make build

# 3. Start the web viewer
make viewer

Open your browser to http://localhost:3000 to view the book.

System Requirements

Required Dependencies

Python 3.8+ with python-docx
ImageMagick 7+ - Image processing
Ghostscript - PDF to PNG conversion
LibreOffice - WMF to PDF conversion
Node.js 16+ - Web viewer

Installation

macOS:

brew install imagemagick ghostscript
brew install --cask libreoffice
make install-deps

Linux (Ubuntu/Debian):

sudo apt-get install imagemagick ghostscript libreoffice python3-pip nodejs npm
make install-deps

macOS LibreOffice Setup:

If LibreOffice was installed via DMG (not Homebrew), run:

make setup-libreoffice

This creates a symlink so ImageMagick can access LibreOffice.

What It Does

Build Pipeline

Word Document
    ↓
1. Extract chapters & sections
2. Parse text with formatting
3. Extract images (WMF → PNG)
4. Process tables
5. Optimize JSON (47% reduction)
6. Build navigation index
    ↓
Interactive Web Viewer

Detailed Steps

Chapter Detection - Identifies chapters by N.0 headings (e.g., "1.0 Health")
Section Splitting - Subdivides chapters into N.X sections (e.g., "1.1", "1.2")
TOC Extraction - Extracts and validates Table of Contents
Content Parsing - Preserves formatting, images, tables, footnotes
WMF Conversion - Converts Windows Metafiles to PNG via LibreOffice → PDF → PNG
JSON Optimization - Removes empty arrays, objects, default values
Index Building - Creates navigation structure with statistics

Usage

Build Commands

make build           # Build complete book content
make rebuild-all     # Clean and rebuild from scratch
make clean           # Remove generated files

Development Commands

make dev             # Build and start viewer in one command
make viewer          # Start chapter-viewer dev server
make status          # Show current project status
make stats           # Display content statistics

Verification Commands

make check-deps      # Verify all dependencies installed
make verify          # Check image integrity and content

Project Structure

project-root/
├── build_book.py                    # Main build system (JSON output)
├── split_chapters.py                # Reference: Split to DOCX chapters
├── split_to_md_chapters.py          # Reference: Split to Markdown chapters
├── verify_images.py                 # Image verification tool
├── Makefile                         # Build automation
├── setup_libreoffice.sh             # LibreOffice configuration helper
├── requirements.txt                 # Python dependencies
├── LICENSE                          # GPL-3.0 license
│
├── English HAH Word Apr 6 2024.docx # Source document (not in repo)
│
├── markdown_chapters/               # Markdown export (optional, not in repo)
│   ├── README.md                    # Navigation index
│   └── chapter_XX/                  # Chapter directories
│       ├── section_X_X.md           # Section content
│       └── pictures/                # Extracted images
│
└── chapter-viewer/                  # STANDALONE React web application
    ├── book_content_json/           # Book data (self-contained!)
    │   ├── index.json               # Navigation index
    │   ├── toc_structure.json       # Table of contents
    │   └── chapter_XX/              # Chapter directories
    │       ├── chapter.json         # Chapter metadata
    │       ├── section_XX.json      # Section content
    │       └── pictures/            # Chapter images
    ├── src/                         # React source code
    ├── public/
    │   └── book_content_json/       # Symlink to ../book_content_json/
    ├── package.json
    └── README.md                    # Standalone usage guide

Output Format

JSON Structure

Each section file contains:

{
  "chapter_number": 1,
  "chapter_title": "1.0 HEALTH & DISEASE",
  "content": [
    {
      "type": "paragraph",
      "index": 0,
      "text": "Full paragraph text",
      "runs": [
        {"text": "Bold text", "bold": true, "font_size": 12.0}
      ],
      "alignment": "LEFT (0)"
    },
    {
      "type": "table",
      "rows": 3,
      "cols": 2,
      "cells": [...]
    }
  ],
  "statistics": {
    "paragraphs": 78,
    "tables": 1,
    "images": 10
  }
}

Key Features

Smart Chapter Detection

Handles both standard chapters (N.0 format) and appendix-style chapters (starting with N.1):

Regular chapters: Start with N.0 heading (e.g., "1.0 Introduction")
Appendix chapters: Start with N.1 section (e.g., "24.1 Infectious Diseases")
29 total chapters fully detected and processed

TOC Validation System

Extracts entire Table of Contents (433 entries)
Excludes TOC paragraphs from actual content
Cross-validates TOC against actual content
Generates detailed discrepancy report
Uses actual content titles as source of truth

JSON Optimization

Achieves 47% file size reduction by removing:

Empty arrays: "images": [], "footnotes": []
Empty objects: "formatting": {}
Empty text runs
Common defaults: "bold": false, "italic": false

Result: 12.8 MB → 6.8 MB (6 MB savings)

WMF Image Conversion

Automatically converts Windows Metafile images using the conversion chain:

WMF → LibreOffice → PDF → Ghostscript → PNG

Handles 35 WMF images (~3% of 1,066 total images).

Configuration

Edit build_book.py to customize:

INPUT_DOCX = "Your-Document.docx"
JSON_DIR = "book_content_json"
ENABLE_OPTIMIZATION = True          # JSON optimization
ENABLE_TOC_VALIDATION = True        # TOC validation

Build Statistics

Typical results for Animal Health Handbook:

Metric	Value
Chapters	29 (100% detected)
Sections	416
Paragraphs	12,389
Tables	70
Images	1,066 (35 WMF converted)
JSON Size	6.8 MB (47% optimized)
Build Time	~90-100 seconds

Troubleshooting

WMF Images Not Converting

# Check if LibreOffice is accessible
libreoffice --version

# If not found, configure it
make setup-libreoffice

# Rebuild
make rebuild-all

Images Not Loading in Viewer

# Check image integrity
make verify

# If issues found, rebuild
make rebuild-all

Build Fails with Missing Dependencies

# Check what's missing
make check-deps

# Install dependencies
make install-deps

Content Not Updating

# Clean and rebuild
make clean
make build

# Force browser refresh
# Chrome/Firefox: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux)

Advanced Usage

Reference Scripts

The project includes reference scripts that demonstrate the conversion logic:

`split_chapters.py` - Split to DOCX Chapters

Legacy script that splits the book into separate DOCX files per chapter, preserving all formatting, images, and footnotes. Useful for:

Creating individual chapter files
Manual editing of specific chapters
Understanding the document structure

python3 split_chapters.py
# Output: chapters/chapter_01.docx, chapter_02.docx, etc.

`split_to_md_chapters.py` - Split to Markdown Chapters

Reference script that converts chapters to Markdown format, demonstrating the same logic as build_book.py but generating readable .md files instead of JSON. Useful for:

Viewing content in any Markdown viewer
Understanding the conversion logic
Creating documentation or exports
Comparing with chapter-viewer output

python3 split_to_md_chapters.py
# Output: markdown_chapters/chapter_01/*.md with images

Features:

✅ Preserves text formatting (bold, italic, underline)
✅ Converts tables to Markdown table format
✅ Extracts and references images
✅ Maintains chapter/section structure
✅ Creates navigation indexes
✅ Output viewable in any Markdown viewer

The Markdown output closely matches what you see in the chapter-viewer, making it perfect for:

Verifying conversion accuracy
Learning how the system processes documents
Creating alternative export formats
Documentation and archival purposes

Custom Document Processing

To process your own Word document:

Place your .docx file in the project root
Update INPUT_DOCX in build_book.py (or reference scripts)
Adjust chapter detection patterns if needed (see is_chapter_heading())
Run make rebuild-all

Disabling Optimization

For debugging or compatibility:

# In build_book.py
ENABLE_OPTIMIZATION = False
ENABLE_TOC_VALIDATION = False

Accessing Validation Reports

After build, check:

chapter-viewer/book_content_json/toc_validation_report.json - TOC discrepancies
chapter-viewer/book_content_json/toc_structure.json - Extracted TOC

Documentation

WMF_CONVERSION_GUIDE.md - Image conversion guide
MARKDOWN_EXPORT_GUIDE.md - Markdown export reference guide
chapter-viewer/README.md - Web viewer documentation
CONTRIBUTING.md - Contribution guidelines

Reference Scripts

split_chapters.py - Split book into DOCX chapter files
split_to_md_chapters.py - Convert chapters to Markdown format
build_book.py - Main build system (JSON output)
verify_images.py - Image verification tool

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Run make verify to check integrity
Submit a pull request

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

This means you can:

✅ Use commercially
✅ Modify the code
✅ Distribute
✅ Use privately

Under the conditions:

📋 Disclose source
📋 License and copyright notice
📋 Same license for derivatives
📋 State changes made

See LICENSE file for full details.

Authors

Original development for Animal Health Handbook document conversion
Authors: Dr. Peter Quesenberry and Dr. Maureen Birmingham (original handbook)

Acknowledgments

python-docx - Word document parsing
ImageMagick - Image processing
LibreOffice - Document conversion
React - Web viewer interface
Vite - Build tooling

Support

For issues, questions, or suggestions:

Check the troubleshooting section above
Review existing issues on GitHub
Create a new issue with:
- System information (OS, Python version, etc.)
- Output of make check-deps
- Error messages or unexpected behavior
- Steps to reproduce

Roadmap

Potential future enhancements:

Support for more document formats (PDF, EPUB input)
Full-text search in viewer
Export to EPUB/PDF from JSON
More aggressive image optimization
Multi-language support
Cloud deployment guides
Docker containerization

Note: This repository does not include the source Word document or generated content. You'll need to provide your own document to process.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
chapter-viewer		chapter-viewer
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MARKDOWN_EXPORT_GUIDE.md		MARKDOWN_EXPORT_GUIDE.md
Makefile		Makefile
README.md		README.md
WMF_CONVERSION_GUIDE.md		WMF_CONVERSION_GUIDE.md
build_book.py		build_book.py
requirements.txt		requirements.txt
setup_libreoffice.sh		setup_libreoffice.sh
split_chapters.py		split_chapters.py
split_to_md_chapters.py		split_to_md_chapters.py
verify_images.py		verify_images.py

License

larsgson/docx2app

Folders and files

Latest commit

History

Repository files navigation

Document Conversion System

🎯 Standalone Chapter Viewer

Using the Viewer Standalone

Distributing Your Book as a Standalone Viewer

Features

Quick Start

System Requirements

Required Dependencies

Installation

What It Does

Build Pipeline

Detailed Steps

Usage

Build Commands

Development Commands

Verification Commands

Project Structure

Output Format

JSON Structure

Key Features

Smart Chapter Detection

TOC Validation System

JSON Optimization

WMF Image Conversion

Configuration

Build Statistics

Troubleshooting

WMF Images Not Converting

Images Not Loading in Viewer

Build Fails with Missing Dependencies

Content Not Updating

Advanced Usage

Reference Scripts

split_chapters.py - Split to DOCX Chapters

split_to_md_chapters.py - Split to Markdown Chapters

Custom Document Processing

Disabling Optimization

Accessing Validation Reports

Documentation

Reference Scripts

Contributing

License

Authors

Acknowledgments

Support

Roadmap

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`split_chapters.py` - Split to DOCX Chapters

`split_to_md_chapters.py` - Split to Markdown Chapters

Packages