Skip to content

[Bug]: getTextContent() fails with Bad encoding in flate stream (with test case) #19609

@mikethea1

Description

@mikethea1

Attach (recommended) or Link to PDF file

Corrupted PDF.pdf

Note: I saw that #11207 was closed due to lack of a publicly available test case. Hopefully this file can help!

Web browser and its version

nodejs 20.18.1

Operating system and its version

Windows 11

PDF.js version

pdfjs-dist 4.10.38

Is the bug present in the latest PDF.js version?

Yes

Is a browser extension

No

Steps to reproduce the problem

const pdfDocument = await pdfjslib.getDocument(...)
const page = await pdfDocument.getPage(16); // or other pages (see "what went wrong")
const textContent = await page.getTextContent({ includeMarkedContent: false }); // throws here

What is the expected behavior?

While this PDF opens in Chrome, not all pages render. It is definitely corrupted. That said, there seemed to be some interest in handling at least the flate stream error more gracefully so I figured it was worth filing.

For my use-case, I'd love if pdfjs would not choke in these cases and instead would yield a page with whatever detail about the page was avialable (e.g. falling back to blank), ideally with a flag on the page object letting me know whether errors occurred.

I understand that this might not be the goal of the library (at least not for all of these issues).

What went wrong?

Processing this file in PDFJS I see a number of errors:

  • getTextContent() on page 16, 87, 101 fails with UnknownErrorException: Bad encoding in flate stream
  • getTextContent() on page 32 fails with UnknownErrorException: Bad (uncompressed) XRef entry: 101R
  • getPage() on pages 33-40, 91-100 fails with UnknownErrorException: Illegal character: 41

Link to a viewer

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions