0

PLEASE NOTE: I understand there are many posts about Tesseract. I have not yet found a working solution that does not produce errors.

I am trying to simply use the OCR on an image with Tesseract. I have tried numerous solutions across various forums and have not been successful. I have converted a pdf to an image and saved said image. I then have called this image using cv2. I have been about to show the image as well. Now, I am trying to apply the image_to_string() command from Tesseract.

I have tried adjusting the pytesseract.pytesseract.tesseract_cmd and made sure that both the wrapper and true tesseract package are installed. Here is the code:

from wand.image import Image
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:/Users/Afton/anaconda3/Scripts/pytesseract.exe'


# Convert from pdf and save as image
pdf = 'C:/path/example.pdf'
outputFilename = 'C:/path/example.jpg'

with Image(filename=pdf) as img:
    img.save(filename=outputFilename)

# Read image
imagePath = outputFilename
image = cv2.imread(imagePath)    

# Configure OCR with pytesseract
config = r'-l deu --oem 1 --psm 3'
text = pytesseract.image_to_string(image, config=config)

# Print text output
text = text.split('\n')
print(text)

This is the current error:

pytesseract.pytesseract.TesseractError: (2, 'Usage: pytesseract [-l lang] input_file')

Before, the error was related to the pytesseract.pytesseract.tesseract_cmd input.

Any help is appreciated.

Updated: the image is in German. I have tried to clarify this in the configuration.

Update2: I tried an alternative path from this resource (with my file location)

pytesseract.pytesseract.tesseract_cmd = r'C:/Program Files/Tesseract-OCR/tesseract.exe' 

I now get this error:

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:\\Program Files\\Tesseract-OCR/tessdata/deu.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'deu\' Tesseract couldn\'t load any languages! Could not initialize tesseract.') 

Note for others with this problem: Downloaded the language package from https://github.com/tesseract-ocr/tessdata because I am reading a German document. All language files are available here. The issue was of the language variety.

2
  • what version of tesseract you installed? Commented Mar 10, 2021 at 18:12
  • @user898678 I used the 64 bit from here: github.com/UB-Mannheim/tesseract/wiki Commented Mar 11, 2021 at 7:57

1 Answer 1

1

This line is wrong:

pytesseract.pytesseract.tesseract_cmd = r'C:/Users/Afton/anaconda3/Scripts/pytesseract.exe'

Please read pytesseract documentation.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.