Skip to content
/ docx2app Public template

πŸš€ Automated system to convert Microsoft Word documents into interactive web-based readers. Extracts chapters, sections, images, and tables into optimized JSON with a React viewer. Perfect for large documents like handbooks and manuals.

License

Notifications You must be signed in to change notification settings

larsgson/docx2app

Repository files navigation

Document Conversion System

License: GPL v3 Python 3.8+ Node.js

A comprehensive automated system to convert Microsoft Word documents into an interactive web-based reader with JSON content backend. Originally created for the Animal Health Handbook, this system can be adapted for any large document with chapter/section structure.

🎯 Standalone Chapter Viewer

The chapter-viewer directory is a fully self-contained React application that can be used independently:

  • βœ… Extract and use separately - Copy the chapter-viewer folder to create your own book viewer
  • βœ… Reusable for any book - Just provide your own JSON content
  • βœ… No dependencies on parent project - All book data stored within the viewer directory
  • βœ… Ready to deploy - Complete standalone web application
  • βœ… Easy to customize - Modern React codebase with clear structure

Using the Viewer Standalone

# Copy the viewer to create your own project
cp -r chapter-viewer my-book-viewer
cd my-book-viewer

# Add your book content to book_content_json/
# (Follow the JSON format described in chapter-viewer/README.md)

# Install and run
pnpm install
pnpm dev

The viewer becomes a universal book reader - perfect for documentation, handbooks, manuals, or any structured content!

See chapter-viewer/README.md for detailed standalone usage instructions.

Distributing Your Book as a Standalone Viewer

After building your book, you can distribute the complete viewer:

# Build your book
make build

# The chapter-viewer directory is now self-contained!
# Package it for distribution:
tar -czf my-book-viewer.tar.gz chapter-viewer/

# Or just copy it anywhere:
cp -r chapter-viewer /path/to/my-book-viewer

# Recipients can use it immediately:
cd my-book-viewer
pnpm install
pnpm dev

The chapter-viewer directory contains:

  • βœ… All book content in book_content_json/
  • βœ… All images in book_content_json/chapter_XX/pictures/
  • βœ… Complete React application
  • βœ… Ready to run with no external dependencies

This makes it perfect for:

  • πŸ“¦ Distributing documentation as a web app
  • 🌐 Hosting on GitHub Pages, Netlify, Vercel
  • πŸ’Ώ Sharing as an offline viewer
  • πŸ“š Creating multiple book viewers from one codebase

Features

  • πŸš€ One-command build - Single make build converts entire document
  • πŸ“š Smart chapter detection - Automatically identifies chapters and sections
  • πŸ–ΌοΈ Image extraction - Extracts all images including WMF conversion
  • πŸ“Š Table processing - Preserves complex table structures
  • 🎨 Format preservation - Maintains bold, italic, fonts, alignment
  • πŸ“¦ 47% size optimization - Intelligent removal of redundant data
  • βœ… TOC validation - Cross-references table of contents with actual content
  • πŸ” Verification tools - Built-in integrity checking
  • πŸ“± React web viewer - Responsive mobile-friendly interface

Quick Start

# 1. Install dependencies
make install-deps

# 2. Build the book
make build

# 3. Start the web viewer
make viewer

Open your browser to http://localhost:3000 to view the book.

System Requirements

Required Dependencies

  • Python 3.8+ with python-docx
  • ImageMagick 7+ - Image processing
  • Ghostscript - PDF to PNG conversion
  • LibreOffice - WMF to PDF conversion
  • Node.js 16+ - Web viewer

Installation

macOS:

brew install imagemagick ghostscript
brew install --cask libreoffice
make install-deps

Linux (Ubuntu/Debian):

sudo apt-get install imagemagick ghostscript libreoffice python3-pip nodejs npm
make install-deps

macOS LibreOffice Setup:

If LibreOffice was installed via DMG (not Homebrew), run:

make setup-libreoffice

This creates a symlink so ImageMagick can access LibreOffice.

What It Does

Build Pipeline

Word Document
    ↓
1. Extract chapters & sections
2. Parse text with formatting
3. Extract images (WMF β†’ PNG)
4. Process tables
5. Optimize JSON (47% reduction)
6. Build navigation index
    ↓
Interactive Web Viewer

Detailed Steps

  1. Chapter Detection - Identifies chapters by N.0 headings (e.g., "1.0 Health")
  2. Section Splitting - Subdivides chapters into N.X sections (e.g., "1.1", "1.2")
  3. TOC Extraction - Extracts and validates Table of Contents
  4. Content Parsing - Preserves formatting, images, tables, footnotes
  5. WMF Conversion - Converts Windows Metafiles to PNG via LibreOffice β†’ PDF β†’ PNG
  6. JSON Optimization - Removes empty arrays, objects, default values
  7. Index Building - Creates navigation structure with statistics

Usage

Build Commands

make build           # Build complete book content
make rebuild-all     # Clean and rebuild from scratch
make clean           # Remove generated files

Development Commands

make dev             # Build and start viewer in one command
make viewer          # Start chapter-viewer dev server
make status          # Show current project status
make stats           # Display content statistics

Verification Commands

make check-deps      # Verify all dependencies installed
make verify          # Check image integrity and content

Project Structure

project-root/
β”œβ”€β”€ build_book.py                    # Main build system (JSON output)
β”œβ”€β”€ split_chapters.py                # Reference: Split to DOCX chapters
β”œβ”€β”€ split_to_md_chapters.py          # Reference: Split to Markdown chapters
β”œβ”€β”€ verify_images.py                 # Image verification tool
β”œβ”€β”€ Makefile                         # Build automation
β”œβ”€β”€ setup_libreoffice.sh             # LibreOffice configuration helper
β”œβ”€β”€ requirements.txt                 # Python dependencies
β”œβ”€β”€ LICENSE                          # GPL-3.0 license
β”‚
β”œβ”€β”€ English HAH Word Apr 6 2024.docx # Source document (not in repo)
β”‚
β”œβ”€β”€ markdown_chapters/               # Markdown export (optional, not in repo)
β”‚   β”œβ”€β”€ README.md                    # Navigation index
β”‚   └── chapter_XX/                  # Chapter directories
β”‚       β”œβ”€β”€ section_X_X.md           # Section content
β”‚       └── pictures/                # Extracted images
β”‚
└── chapter-viewer/                  # STANDALONE React web application
    β”œβ”€β”€ book_content_json/           # Book data (self-contained!)
    β”‚   β”œβ”€β”€ index.json               # Navigation index
    β”‚   β”œβ”€β”€ toc_structure.json       # Table of contents
    β”‚   └── chapter_XX/              # Chapter directories
    β”‚       β”œβ”€β”€ chapter.json         # Chapter metadata
    β”‚       β”œβ”€β”€ section_XX.json      # Section content
    β”‚       └── pictures/            # Chapter images
    β”œβ”€β”€ src/                         # React source code
    β”œβ”€β”€ public/
    β”‚   └── book_content_json/       # Symlink to ../book_content_json/
    β”œβ”€β”€ package.json
    └── README.md                    # Standalone usage guide

Output Format

JSON Structure

Each section file contains:

{
  "chapter_number": 1,
  "chapter_title": "1.0 HEALTH & DISEASE",
  "content": [
    {
      "type": "paragraph",
      "index": 0,
      "text": "Full paragraph text",
      "runs": [
        {"text": "Bold text", "bold": true, "font_size": 12.0}
      ],
      "alignment": "LEFT (0)"
    },
    {
      "type": "table",
      "rows": 3,
      "cols": 2,
      "cells": [...]
    }
  ],
  "statistics": {
    "paragraphs": 78,
    "tables": 1,
    "images": 10
  }
}

Key Features

Smart Chapter Detection

Handles both standard chapters (N.0 format) and appendix-style chapters (starting with N.1):

  • Regular chapters: Start with N.0 heading (e.g., "1.0 Introduction")
  • Appendix chapters: Start with N.1 section (e.g., "24.1 Infectious Diseases")
  • 29 total chapters fully detected and processed

TOC Validation System

  • Extracts entire Table of Contents (433 entries)
  • Excludes TOC paragraphs from actual content
  • Cross-validates TOC against actual content
  • Generates detailed discrepancy report
  • Uses actual content titles as source of truth

JSON Optimization

Achieves 47% file size reduction by removing:

  • Empty arrays: "images": [], "footnotes": []
  • Empty objects: "formatting": {}
  • Empty text runs
  • Common defaults: "bold": false, "italic": false

Result: 12.8 MB β†’ 6.8 MB (6 MB savings)

WMF Image Conversion

Automatically converts Windows Metafile images using the conversion chain:

WMF β†’ LibreOffice β†’ PDF β†’ Ghostscript β†’ PNG

Handles 35 WMF images (~3% of 1,066 total images).

Configuration

Edit build_book.py to customize:

INPUT_DOCX = "Your-Document.docx"
JSON_DIR = "book_content_json"
ENABLE_OPTIMIZATION = True          # JSON optimization
ENABLE_TOC_VALIDATION = True        # TOC validation

Build Statistics

Typical results for Animal Health Handbook:

Metric Value
Chapters 29 (100% detected)
Sections 416
Paragraphs 12,389
Tables 70
Images 1,066 (35 WMF converted)
JSON Size 6.8 MB (47% optimized)
Build Time ~90-100 seconds

Troubleshooting

WMF Images Not Converting

# Check if LibreOffice is accessible
libreoffice --version

# If not found, configure it
make setup-libreoffice

# Rebuild
make rebuild-all

Images Not Loading in Viewer

# Check image integrity
make verify

# If issues found, rebuild
make rebuild-all

Build Fails with Missing Dependencies

# Check what's missing
make check-deps

# Install dependencies
make install-deps

Content Not Updating

# Clean and rebuild
make clean
make build

# Force browser refresh
# Chrome/Firefox: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux)

Advanced Usage

Reference Scripts

The project includes reference scripts that demonstrate the conversion logic:

split_chapters.py - Split to DOCX Chapters

Legacy script that splits the book into separate DOCX files per chapter, preserving all formatting, images, and footnotes. Useful for:

  • Creating individual chapter files
  • Manual editing of specific chapters
  • Understanding the document structure
python3 split_chapters.py
# Output: chapters/chapter_01.docx, chapter_02.docx, etc.

split_to_md_chapters.py - Split to Markdown Chapters

Reference script that converts chapters to Markdown format, demonstrating the same logic as build_book.py but generating readable .md files instead of JSON. Useful for:

  • Viewing content in any Markdown viewer
  • Understanding the conversion logic
  • Creating documentation or exports
  • Comparing with chapter-viewer output
python3 split_to_md_chapters.py
# Output: markdown_chapters/chapter_01/*.md with images

Features:

  • βœ… Preserves text formatting (bold, italic, underline)
  • βœ… Converts tables to Markdown table format
  • βœ… Extracts and references images
  • βœ… Maintains chapter/section structure
  • βœ… Creates navigation indexes
  • βœ… Output viewable in any Markdown viewer

The Markdown output closely matches what you see in the chapter-viewer, making it perfect for:

  • Verifying conversion accuracy
  • Learning how the system processes documents
  • Creating alternative export formats
  • Documentation and archival purposes

Custom Document Processing

To process your own Word document:

  1. Place your .docx file in the project root
  2. Update INPUT_DOCX in build_book.py (or reference scripts)
  3. Adjust chapter detection patterns if needed (see is_chapter_heading())
  4. Run make rebuild-all

Disabling Optimization

For debugging or compatibility:

# In build_book.py
ENABLE_OPTIMIZATION = False
ENABLE_TOC_VALIDATION = False

Accessing Validation Reports

After build, check:

  • chapter-viewer/book_content_json/toc_validation_report.json - TOC discrepancies
  • chapter-viewer/book_content_json/toc_structure.json - Extracted TOC

Documentation

Reference Scripts

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run make verify to check integrity
  5. Submit a pull request

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

This means you can:

  • βœ… Use commercially
  • βœ… Modify the code
  • βœ… Distribute
  • βœ… Use privately

Under the conditions:

  • πŸ“‹ Disclose source
  • πŸ“‹ License and copyright notice
  • πŸ“‹ Same license for derivatives
  • πŸ“‹ State changes made

See LICENSE file for full details.

Authors

  • Original development for Animal Health Handbook document conversion
  • Authors: Dr. Peter Quesenberry and Dr. Maureen Birmingham (original handbook)

Acknowledgments

  • python-docx - Word document parsing
  • ImageMagick - Image processing
  • LibreOffice - Document conversion
  • React - Web viewer interface
  • Vite - Build tooling

Support

For issues, questions, or suggestions:

  1. Check the troubleshooting section above
  2. Review existing issues on GitHub
  3. Create a new issue with:
    • System information (OS, Python version, etc.)
    • Output of make check-deps
    • Error messages or unexpected behavior
    • Steps to reproduce

Roadmap

Potential future enhancements:

  • Support for more document formats (PDF, EPUB input)
  • Full-text search in viewer
  • Export to EPUB/PDF from JSON
  • More aggressive image optimization
  • Multi-language support
  • Cloud deployment guides
  • Docker containerization

Note: This repository does not include the source Word document or generated content. You'll need to provide your own document to process.

About

πŸš€ Automated system to convert Microsoft Word documents into interactive web-based readers. Extracts chapters, sections, images, and tables into optimized JSON with a React viewer. Perfect for large documents like handbooks and manuals.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published