When can we trust a /Pages dictionary to contain Page references directly? #18637
Replies: 7 comments
-
|
Are you able to modify the pdf structure ? could help to get quickly the last page without having to fetch all the other pages. |
Beta Was this translation helpful? Give feedback.
-
|
No, we're not in control of the PDF (it's a customer upload) and it's a preservation copy so we want to maintain it as it comes to us and not rewrite it. |
Beta Was this translation helpful? Give feedback.
-
|
Are the pdfs linearized ? |
Beta Was this translation helpful? Give feedback.
-
|
I'll have to check our examples again (it is a year since I investigated this :) ) but iirc not all of them are marked with a /Linearized dictionary. For our purposes we'd probably still want something like "it is Linearized or has more than N pages". Some of them are public (our customer is hosting them publicly through our portal), I can ask if I can drop links. |
Beta Was this translation helpful? Give feedback.
-
|
I don't know that much about linearized pdf, I'm reading the specs and normally they're built for performances. |
Beta Was this translation helpful? Give feedback.
-
|
I'd incline to accept your PR (without the 20 limitation) but I just want to make sure we don't make a mistake. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for reminding me that I meant to reply to this :) I went and found our examples from when we were dealing with this issue, and one of our main use cases for this has PDFs which aren't /Linearized at all :( It's possible that the page hint table would help with linearized PDFs, honestly I didn't know about that, and it's hard to read that data so I can't easily check in the linearized file I have. I'll try to construct some intentionally violating examples. I would expect some outcomes like "the wrong pages are shown" (i.e. you're looking at page 5 but actually page 6 is rendered) type situations. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Related: my patch here #18627 and discussion on this issue #14570
Use case: Render-on-demand (i.e. disableStream and disableAutoFetch set, and the server supports ranged requests) on a large document like a PDF where each page is an image, and we want to show something to the user quickly. Such documents are common outputs of digitisation.
When the PDF has (i) all the /Page dicts referenced from the top level /Pages dict and (ii) /Page dicts are interleaved with content, loading the last page causes a XHR to be sent for every prior page (see my comment on #14570 for more detail).
If we could trust that the /Kids of the top level /Pages dict contained exactly the refs to /Page dicts and nothing else, we could avoid loading all the previous pages and fetch of a single page is quick. The team rightly don't want to be too trusting of that, since lots of PDFs get created wrongly (see all the justifications on tickets mentioned here #14570 (comment)). The question in this discussion is: are there circumstances where we can trust that.
This change master...richard-smith-preservica:pdf.js:rcs/assume-all-pages-in-top-level-when-likely contains the core of the idea - if the page count and number of children of the top level /Pages align then assume it's 1:1 and fetch pages independently.
Possible criteria for 'this is ok'
... but really I'm opening this to see if anyone has more robust suggestions for ways we can avoid fetching every /Page dictionary in this case.
Beta Was this translation helpful? Give feedback.
All reactions