tesseract ocr pdf - segmentation fault

Question

I am trying to OCR pdf file with tesseract, but it says:

Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:upload526.pdf IMAGE::read_header:Error:Can't read this image type:upload526.pdf tesseract:Error:Read of file failed:upload526.pdf Segmentation fault

I need it to make a database to search through pdfs that were scanned manually (to images)... What am I doing wrong? I read that it supports pdfs... No idea what version it is as tesseract --version or tesseract -v doesn't work at all.

nguyenq · Accepted Answer · 2014-12-13 00:03:53Z

1

Tesseract does not read PDF. You'll need to convert it to an image format (TIFF, PNG) first. Try GhostScript, ImageMagick, programming, etc.

answered Dec 13, 2014 at 0:03

nguyenq

8,4031 gold badge19 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Reuben L. · Accepted Answer · 2014-12-15 06:57:59Z

1

You could try something along the lines of this (ImageMagick library):

convert -density 300 file.pdf -depth 8 file.tiff  
tesseract file.tiff output

answered Dec 15, 2014 at 6:57

Reuben L.

2,8592 gold badges31 silver badges48 bronze badges

Collectives™ on Stack Overflow

tesseract ocr pdf - segmentation fault

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related