Skip to content

Endless Loop When Processing Certain Large PDF with PdfFileWriter #358

@suokunlong

Description

@suokunlong

STEPS TO REPRODUCE:

  1. Download the test pdf file:
    https://suokunlong.cn/owncloud/index.php/s/bWyTHYfoMii3Yh9
    The file is named 2017-Textbook-EconomicLaw.pdf, which is 65.1MB of 511 pages.

  2. Run the following code:

from PyPDF2 import PdfFileWriter, PdfFileReader

pdf_in_filename = r"/path/to/2017-Textbook-EconomicLaw.pdf"
pdf_out_filename = r"/path/to/2017-Textbook-EconomicLaw-new.pdf"

pdf_out = PdfFileWriter()
pdf_in = PdfFileReader(open(pdf_in_filename, 'rb'))

numpages = pdf_in.getNumPages()
for i in range(numpages):
    pdf_out.addPage(pdf_in.getPage(i))

with open(pdf_out_filename, 'wb') as outputStream:
    pdf_out.write(outputStream)
  1. The code is running forever at the last row.

OTHER USEFUL INFORMATION:

  1. I noticed that if I revise the line:
for i in range(numpages):

to:

for i in range(3):

then I will get the output very quickly.

  1. I also noticed that if I open the test pdf file using evince in my Linux desktop, and print it to a new pdf file, then the above code finishes within 5s.

PyPDF2.version
'1.25.1'

Metadata

Metadata

Assignees

No one assigned

    Labels

    PdfWriterThe PdfWriter component is affectedis-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions