Skip to content

docx parsing was broken in 1.24.7 #3654

@ichenhe

Description

@ichenhe

Description of the bug

Can not read the .docx file. It worked perfectly on v1.24.6.

Logs:

Traceback (most recent call last):
  File "~/d.py", line 15, in <module>
    print(extract_text(local_file_path))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/d.py", line 8, in extract_text
    content += page.get_text() + '\n\n'
               ^^^^^^^^^^^^^^^
  File "~/python3.11/site-packages/pymupdf/utils.py", line 798, in get_text
    cb = page.cropbox
         ^^^^^^^^^^^^
  File "~/python3.11/site-packages/pymupdf/__init__.py", line 8535, in cropbox
    page = self._pdf_page()
           ^^^^^^^^^^^^^^^^
  File "~/python3.11/site-packages/pymupdf/__init__.py", line 8051, in _pdf_page
    return _as_pdf_page(self.this)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/python3.11/site-packages/pymupdf/__init__.py", line 337, in _as_pdf_page
    assert ret.m_internal
AssertionError

How to reproduce the bug

TEST.docx

import fitz


def extract_text(file: str) -> str:
    content = ""
    with fitz.open(file) as document:
        for page in document:
            content += page.get_text() + '\n\n'
    content = content.strip()
    return content


if __name__ == '__main__':
    local_file_path = '/path/to/TEST.docx'
    print(extract_text(local_file_path))

PyMuPDF version

1.24.7

Operating system

MacOS

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions