-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Attach (recommended) or Link to PDF file
Note: I saw that #11207 was closed due to lack of a publicly available test case. Hopefully this file can help!
Web browser and its version
nodejs 20.18.1
Operating system and its version
Windows 11
PDF.js version
pdfjs-dist 4.10.38
Is the bug present in the latest PDF.js version?
Yes
Is a browser extension
No
Steps to reproduce the problem
const pdfDocument = await pdfjslib.getDocument(...)
const page = await pdfDocument.getPage(16); // or other pages (see "what went wrong")
const textContent = await page.getTextContent({ includeMarkedContent: false }); // throws here
What is the expected behavior?
While this PDF opens in Chrome, not all pages render. It is definitely corrupted. That said, there seemed to be some interest in handling at least the flate stream error more gracefully so I figured it was worth filing.
For my use-case, I'd love if pdfjs would not choke in these cases and instead would yield a page with whatever detail about the page was avialable (e.g. falling back to blank), ideally with a flag on the page object letting me know whether errors occurred.
I understand that this might not be the goal of the library (at least not for all of these issues).
What went wrong?
Processing this file in PDFJS I see a number of errors:
- getTextContent() on page 16, 87, 101 fails with
UnknownErrorException: Bad encoding in flate stream - getTextContent() on page 32 fails with
UnknownErrorException: Bad (uncompressed) XRef entry: 101R - getPage() on pages 33-40, 91-100 fails with
UnknownErrorException: Illegal character: 41
Link to a viewer
No response
Additional context
No response