1

I want to extract all pages from this PDF file, improve their color levels, and eventually OCR them.

I've used Imagemagick:

magick Historia_de_CA_vol1_Cap1_0.pdf mogrify -auto-level Historia_de_CA_vol1_Cap1_0-*.jpg,

which remarkably improves the quality of embedded images, as can be seen in the document's 1st and 21st pages. I suspect this is because Imagemagick properly interprets a transparency layer that is converted to a black or dark background by Adobe Acrobat Reader. Unfortunately, the extracted text is blurrier than in the original

I've also used poppler's PDFtoppm utility:

pdftoppm -jpeg Historia_de_CA_vol1_Cap1_0.pdf Historia_de_CA_vol1_Cap1_0,

which produces crisp text, suitable for OCR, but retains the poor quality of the embedded images seen on pages 1 and 21 of the original PDF, where transparency seems to be rendered as a dark layer.

How can I get Imagemagick to produce improved images and crisp text suitable for OCR, or conversely, how can I get PDFtoppm to properly render the suspected transparent layer in the original PDF?

1
  • Questions about interactive use of the imagemagick command line tool should be asked on Super User. Commented Dec 17, 2024 at 17:38

1 Answer 1

2

Your imagemagick command may be flawed. With magick mogrify, do not separate them with images. The structure of magick mogrify is

magick mogrify -path path_to_output -format format_for_output * (or *.suffix)

This reads all images in the current directory and writes them with the same name to the desired directory and with the desired suffix.

Perhaps you want just magick, not magick mogrify

magick Historia_de_CA_vol1_Cap1_0.pdf -auto-level Historia_de_CA_vol1_Cap1_0.jpg

That will create outputs with Historia_de_CA_vol1_Cap1_0-N.jpg where N is 0 to the number of pages.

ADDITION

To increase text sharpness, change the density and then resize by the inverse.

magick -density 288 Historia_de_CA_vol1_Cap1_0.pdf -resize 25% -auto-level Historia_de_CA_vol1_Cap1_0.jpg

(Note: density of 288=72x4, so resize by 1/4=25%)

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your suggestions. With your commands, ImageMagick again improves the quality of embedded images, but, the extracted text is blurrier than in the original. I suspect that a transparency layer makes the text crisp but darkens the images in the original PDF and in PDFtoppm's output. ImageMagick processes the transparency correctly, improving the embedded images, but losing sharpness in the text.
Put -density 288 before reading the pdf (right after magick), then put -resize 25% right after reading the pdf, if you want to preserve the dimensions. If you leave off the -resize, then the output will by 4x in each dimension. (Note 288 = 4x72, where 72 dpi is the default density)
@oavaldezi See my ADDITION in my answer for the command as described above in my comment
Thanks! That helped a great deal with the sharpness of the text. I also replaced '-auto-level' with '-normalize', and the contrast improved significantly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.