3

I am working on python tesseract package with sample code like the follows:

import pytesseract
from PIL import Image

tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)

And I received the following error message:

pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')

From my understanding, the error occurred when reading the file chi-sim.traineddata (which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.

  • My developing environment is M1 macOS, and I installed tesseract and tesseract-lang from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I call
print(pytesseract.get_languages(config = ""))

I get a long list of languages printed, including chi-sim.

  • Further, if we just use English instead of Chinese, the following code can successfully recognize the English texts in an image:
text = pytesseract.image_to_string(image)
  • I've tried to specify environment variable TESSDATA_PREFIX in multiple ways, including:
  1. Using config parameter as in the original code.

  2. Adding global environment variable in PyCharm.

  3. Adding the following line in the code

os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
  1. Adding the following line to bash_profile in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/

But unfortunately, none of these works.

  • It seems as if my file chi-sim.traineddata is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata is located). Yes, I've tried both, but neither works.

With respect to this issue, is there any potential solutions?

13
  • If you are on windows, did you setup the environmental PATH for tesseract? Commented Jul 17, 2021 at 13:07
  • 1
    if get_languages(config = "") shows chi-sim then why do you set tessdata-dir? Did you try without changing tessdata-dir? Commented Jul 17, 2021 at 13:11
  • 1
    Also, what is your language setting in your Mac OS? There used to be some issue with non-English system language for tesseract Commented Jul 17, 2021 at 13:14
  • 1
    in question (not in comment) you could add link to GitHub where you found chi-sim.traineddata - and you could describe how you downloaded it. Maybe you download it in wrong way (i.e in text-mode instead of bytes-mode) or maybe you get files for older version - see GitHub with tessdata for 4.x there is link to tessdata for 3.x Commented Jul 17, 2021 at 13:22
  • 1
    @seraph Hmmm... this is a good point because my general language setting of my device is chi-sim. I will check it out later and update this post if there's anything good. Commented Jul 17, 2021 at 13:39

1 Answer 1

1

Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -.


If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.