When can we trust a /Pages dictionary to contain Page references directly? #18637

richard-smith-preservica · 2024-08-21T14:59:32Z

richard-smith-preservica
Aug 21, 2024

Related: my patch here #18627 and discussion on this issue #14570

Use case: Render-on-demand (i.e. disableStream and disableAutoFetch set, and the server supports ranged requests) on a large document like a PDF where each page is an image, and we want to show something to the user quickly. Such documents are common outputs of digitisation.

When the PDF has (i) all the /Page dicts referenced from the top level /Pages dict and (ii) /Page dicts are interleaved with content, loading the last page causes a XHR to be sent for every prior page (see my comment on #14570 for more detail).

If we could trust that the /Kids of the top level /Pages dict contained exactly the refs to /Page dicts and nothing else, we could avoid loading all the previous pages and fetch of a single page is quick. The team rightly don't want to be too trusting of that, since lots of PDFs get created wrongly (see all the justifications on tickets mentioned here #14570 (comment)). The question in this discussion is: are there circumstances where we can trust that.

This change master...richard-smith-preservica:pdf.js:rcs/assume-all-pages-in-top-level-when-likely contains the core of the idea - if the page count and number of children of the top level /Pages align then assume it's 1:1 and fetch pages independently.

Possible criteria for 'this is ok'

There are more than N pages (that straw man uses 20) - it's more unlikely that a 'bad' PDF has 30 pages and 30 /Kids entries but isn't 1:1 than if that number is 4, and the benefit is larger for longer documents
The PDF is linearised - tools that generate linearised PDFs might be more trusted
An explicit option in configuration (like disableAutoFetch) so you can choose based on context (e.g. we could set that for PDFs that have passed validation; other users might be able to set it if they generated the PDF so they know it's ok)

... but really I'm opening this to see if anyone has more robust suggestions for ways we can avoid fetching every /Page dictionary in this case.

calixteman · 2025-10-13T16:15:41Z

calixteman
Oct 13, 2025
Maintainer

Are you able to modify the pdf structure ?
If yes maybe having something like:

Pages (1R) {
   Count: 123,
   Type: /Pages,
   Kids: [ 
      { Count: 1, Kids: [ 2R ], Type: /Pages, Parent: 1R },
      { Count: 1, Kids: [ 3R ], Type: /Pages, Parent: 1R },
      ....
   ]
}

could help to get quickly the last page without having to fetch all the other pages.
Wdyt ?

0 replies

richard-smith-preservica · 2025-10-14T10:04:27Z

richard-smith-preservica
Oct 14, 2025
Author

No, we're not in control of the PDF (it's a customer upload) and it's a preservation copy so we want to maintain it as it comes to us and not rewrite it.

0 replies

calixteman · 2025-10-14T18:44:05Z

calixteman
Oct 14, 2025
Maintainer

Are the pdfs linearized ?
I'd tend to agree that if the page number from the Linearized dictionary and the count from the Pages dictionary are equal we could probably trust the number.
So would it be acceptable for you to skip the page number verification if the conditions above are met ?
If yes, could you write a patch and we'll see how it behaves in our test suite ?

0 replies

richard-smith-preservica · 2025-10-15T11:55:19Z

richard-smith-preservica
Oct 15, 2025
Author

I'll have to check our examples again (it is a year since I investigated this :) ) but iirc not all of them are marked with a /Linearized dictionary. For our purposes we'd probably still want something like "it is Linearized or has more than N pages".

Some of them are public (our customer is hosting them publicly through our portal), I can ask if I can drop links.

0 replies

calixteman · 2025-10-16T16:47:12Z

calixteman
Oct 16, 2025
Maintainer

I don't know that much about linearized pdf, I'm reading the specs and normally they're built for performances.
Normally there's a page offset hint table which should allow to get what ever page very quickly without having to fetch all the file.
As far as I can tell we don't really use these extra data there are in order to improve things but we just use the linearization data in order to render the first page a bit faster.
Honestly I'm not super excited by your proposal: we try to fix bugs in general and we don't really try to add new ones, on purpose, for making someone happy. But I'd be happy to try to improve things for linearized pdfs which are supposed to fit your use case (if I understood correctly).

0 replies

calixteman · 2025-10-29T14:21:21Z

calixteman
Oct 29, 2025
Maintainer

I'd incline to accept your PR (without the 20 limitation) but I just want to make sure we don't make a mistake.
Would it possible to build few pdfs with 20 Kids and a Count equal to 20 but failing your assertion ?
For example a pdf could have 2 pages under the first Kid and 0 under an other one, or could have whatever number of pages in a tree structure but the Count would be just wrong.
I'd like to see how the other viewers handle them.

0 replies

richard-smith-preservica · 2025-10-29T15:00:21Z

richard-smith-preservica
Oct 29, 2025
Author

Thanks for reminding me that I meant to reply to this :)

I went and found our examples from when we were dealing with this issue, and one of our main use cases for this has PDFs which aren't /Linearized at all :(

It's possible that the page hint table would help with linearized PDFs, honestly I didn't know about that, and it's hard to read that data so I can't easily check in the linearized file I have.

I'll try to construct some intentionally violating examples. I would expect some outcomes like "the wrong pages are shown" (i.e. you're looking at page 5 but actually page 6 is rendered) type situations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When can we trust a /Pages dictionary to contain Page references directly? #18637

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

When can we trust a /Pages dictionary to contain Page references directly? #18637

Uh oh!

richard-smith-preservica Aug 21, 2024

Replies: 7 comments

Uh oh!

calixteman Oct 13, 2025 Maintainer

Uh oh!

richard-smith-preservica Oct 14, 2025 Author

Uh oh!

calixteman Oct 14, 2025 Maintainer

Uh oh!

richard-smith-preservica Oct 15, 2025 Author

Uh oh!

calixteman Oct 16, 2025 Maintainer

Uh oh!

calixteman Oct 29, 2025 Maintainer

Uh oh!

richard-smith-preservica Oct 29, 2025 Author

richard-smith-preservica
Aug 21, 2024

calixteman
Oct 13, 2025
Maintainer

richard-smith-preservica
Oct 14, 2025
Author

calixteman
Oct 14, 2025
Maintainer

richard-smith-preservica
Oct 15, 2025
Author

calixteman
Oct 16, 2025
Maintainer

calixteman
Oct 29, 2025
Maintainer

richard-smith-preservica
Oct 29, 2025
Author