When extracting pages from PDF, Imagemagick improves images but blurs text, PDFtoppm retains crisp text but also dark images

Question

I want to extract all pages from this PDF file, improve their color levels, and eventually OCR them.

I've used Imagemagick:

magick Historia_de_CA_vol1_Cap1_0.pdf mogrify -auto-level Historia_de_CA_vol1_Cap1_0-*.jpg,

which remarkably improves the quality of embedded images, as can be seen in the document's 1st and 21st pages. I suspect this is because Imagemagick properly interprets a transparency layer that is converted to a black or dark background by Adobe Acrobat Reader. Unfortunately, the extracted text is blurrier than in the original

I've also used poppler's PDFtoppm utility:

pdftoppm -jpeg Historia_de_CA_vol1_Cap1_0.pdf Historia_de_CA_vol1_Cap1_0,

which produces crisp text, suitable for OCR, but retains the poor quality of the embedded images seen on pages 1 and 21 of the original PDF, where transparency seems to be rendered as a dark layer.

How can I get Imagemagick to produce improved images and crisp text suitable for OCR, or conversely, how can I get PDFtoppm to properly render the suspected transparent layer in the original PDF?

Questions about interactive use of the imagemagick command line tool should be asked on Super User. — Christoph Rackwitz
– Christoph Rackwitz, Commented Dec 17, 2024 at 17:38

fmw42 · Accepted Answer · 2024-12-17 21:32:23Z

2

Your imagemagick command may be flawed. With magick mogrify, do not separate them with images. The structure of magick mogrify is

magick mogrify -path path_to_output -format format_for_output * (or *.suffix)

This reads all images in the current directory and writes them with the same name to the desired directory and with the desired suffix.

Perhaps you want just magick, not magick mogrify

magick Historia_de_CA_vol1_Cap1_0.pdf -auto-level Historia_de_CA_vol1_Cap1_0.jpg

That will create outputs with Historia_de_CA_vol1_Cap1_0-N.jpg where N is 0 to the number of pages.

ADDITION

To increase text sharpness, change the density and then resize by the inverse.

magick -density 288 Historia_de_CA_vol1_Cap1_0.pdf -resize 25% -auto-level Historia_de_CA_vol1_Cap1_0.jpg

(Note: density of 288=72x4, so resize by 1/4=25%)

edited Dec 17, 2024 at 21:32

answered Dec 17, 2024 at 16:43

fmw42

54.1k10 gold badges80 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

oavaldezi Dec 17, 2024 at 17:34

Thank you for your suggestions. With your commands, ImageMagick again improves the quality of embedded images, but, the extracted text is blurrier than in the original. I suspect that a transparency layer makes the text crisp but darkens the images in the original PDF and in PDFtoppm's output. ImageMagick processes the transparency correctly, improving the embedded images, but losing sharpness in the text.

fmw42 Dec 17, 2024 at 18:18

Put -density 288 before reading the pdf (right after magick), then put -resize 25% right after reading the pdf, if you want to preserve the dimensions. If you leave off the -resize, then the output will by 4x in each dimension. (Note 288 = 4x72, where 72 dpi is the default density)

fmw42 Dec 17, 2024 at 19:05

@oavaldezi See my ADDITION in my answer for the command as described above in my comment

oavaldezi Dec 17, 2024 at 19:06

Thanks! That helped a great deal with the sharpness of the text. I also replaced '-auto-level' with '-normalize', and the contrast improved significantly.

Collectives™ on Stack Overflow

When extracting pages from PDF, Imagemagick improves images but blurs text, PDFtoppm retains crisp text but also dark images

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related