1

I am trying to OCR pdf file with tesseract, but it says:

Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:upload526.pdf IMAGE::read_header:Error:Can't read this image type:upload526.pdf tesseract:Error:Read of file failed:upload526.pdf Segmentation fault

I need it to make a database to search through pdfs that were scanned manually (to images)... What am I doing wrong? I read that it supports pdfs... No idea what version it is as tesseract --version or tesseract -v doesn't work at all.

2 Answers 2

1

Tesseract does not read PDF. You'll need to convert it to an image format (TIFF, PNG) first. Try GhostScript, ImageMagick, programming, etc.

Sign up to request clarification or add additional context in comments.

Comments

1

You could try something along the lines of this (ImageMagick library):

convert -density 300 file.pdf -depth 8 file.tiff  
tesseract file.tiff output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.