1

I have installed language support for chi_sim:

 ls  /usr/share/tesseract-ocr/5/tessdata
chi_sim.traineddata  eng.traineddata  pdf.ttf
configs          osd.traineddata  tessconfigs

You can try it by downloading photo.jpeg and using the following code:

import cv2
from PIL import Image
import pytesseract
from pyocr import tesseract
image_path = 'photo.jpeg'
image = cv2.imread(image_path)
image = Image.fromarray(image)
text = pytesseract.image_to_string(image, lang='chi_sim')
print(text)

Why do I get nothing in output with above code?

>>> print(pytesseract.get_languages(config=''))
['chi_sim', 'eng', 'osd']
9
  • check how to write chis_sim in the right way doing print(pytesseract.get_languages(config='')) to get language list (I cannot install pytesseract, sorry) Commented Jul 24 at 14:07
  • It is better for you to have a try with code and post your result. Commented Jul 24 at 14:14
  • referring at stackoverflow.com/questions/68420764/… perhaps you need to give config = "--tessdata-dir \"/usr/share/tesseract-ocr/5/tessdata\"" in the call of pytesseract.image_to_string(...) even this seems strange because pytesseract.get_languages(config='') gives chi_sim Commented Jul 24 at 15:02
  • Looking at the output above, it looks like the call to tesseract returned a blank line. This suggests the the image is not clear enough to decode. You could try to run it manually from the command line to see what happens. Commented Jul 24 at 15:36
  • tesseract prefers black text on white background, and it may have problem when text is too small or too big. And it may need 72dpi or even more. See more: Improving the quality of the output | tessdoc Commented Jul 24 at 16:12

2 Answers 2

1
+100

That image as it stands is simply too poor for tesseract to see clear characters. It would need to be rectified and contrast improved and colour thresholding to remove the background noise.

So this image shows how some of those might be rectified. However what is left is still simply below par for ordinary OCR.

enter image description here

So why can some systems see that image and generate good text, like this:

中华人民共和国
居民身份证
签发机关
有效期限
2007.05.14-2027.05 14

And the answer is aggregation of many similar images where they can see an average above par.

enter image description here

Even If you clean an image as good as this. Tesseract will still not come as close to an Artificially Improved interpretation.

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

To get better results from Tesseract, we'll need to convert the image to a binary image - which requires first converting the current three channel (RGB/BGR) image to a single channel grayscale image, then applying thresholding to this image to convert pixels above a certain threshold to white (255) and black (0). To see this approach applied to this problem, see below:

import cv2
import pytesseract

if __name__ == "__main__":
    image_path = 'photo.jpeg'
    image = cv2.imread(image_path)

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.adaptiveThreshold(
        gray, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY,
        21, 8
    )

    pipeline = cv2.hconcat([image, cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR), cv2.cvtColor(thresh, cv2.COLOR_GRAY2BGR)])


    cv2.imshow("Preprocessing Pipeline [Original | Grayscale | Thresholded]", pipeline)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

    text = pytesseract.image_to_string(thresh, lang='chi_sim')
    print("OCR Result:")
    print(text)

The visualization code can be omitted, but it was included for clarity - the figure should look something like:

enter image description here
Where the left is the original image provided, the center is the image converted to grayscale, and the right image is the output from thresholding to a binary image. The output is:

OCR Result:

山 华 人 氓 典 和

居 民 身 份 证 ,

签 发 机 关

, ,

0 国 咤

有 效 期 限 “2007.05.14-2027.05.14

Which isn't perfect but it's extracting some of the text snippets in the image. The primary configuration in the standard transformation pipeline is through the parameters of the adaptive thresholding algorithm - a tutorial from the official OpenCV documentation detailing the parameters for this algorithm (as well as other thresholding algorithms) can be found here. Additionally, you can experiment with the thresholding parameters in the script to evaluate the impact on the final extraction.

There are other transformations that can be applied, such as blurring and morphological operations which are mentioned in the documentation from Tesseract for improving quality. For example, in the snippet above, we could apply the blur operation after converting to gray scale but before thresholding:

blur = cv2.GaussianBlur(gray, (3, 3), 0) # alternatively, a box filter can be used via cv2.blur/cv2.boxFilter

For more sophisticated use-cases (e.g; a large set of images where traditional thresholding algorithms can't properly delineate the text, or Tesseract isn't working), there are deep learning models that are available. For example, CLOVA AI has models for both text detection (identifying there is text in an image) and text recognition (parsing text in an image into strings, similar to OCR).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.