Python tesseract and opencv - image_to_boxes() getting the wrong positions for characters

Question

I'm trying to use tesseract and opencv in Python to extract every character from an image and save each character to an individual image file. My code has no problem recognizing the text properly and printing it out, but it's not recognizing the position and size of the individual characters properly. Here's the input image:

https://i.sstatic.net/fYYlu.png

Here's my code:

#=Imports======================================================================
import cv2
import sys
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\User\AppData\Local\Tesseract-OCR\tesseract.exe'
import math
from PIL import ImageGrab

#=Main=Code====================================================================

#Read in image
img = cv2.imread("feldman.png")

#Processing to make the image suitable for OCR
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) #Convert image to greyscale
img = cv2.threshold(img, 190, 255, cv2.THRESH_BINARY)[1] #Apply threshold effect

#Perform OCR and print to command line
print("Output from image_to_string():")
print(pytesseract.image_to_string(img))

#Save each character as an image
print("")
print("First character of each line from the output of image_to_boxes():")
hImg, wImg = img.shape #Get the dimensions of the image
boxes = pytesseract.image_to_boxes(img) #Analyzes where boxes would be drawn around each character in an image and creates a string with many lines, one line per box, each line containing data about its box. The data structure for each line/box is: character x1 y1 x2 y2 0 (not sure what the last one is but it's always 0), for example: s 596 164 609 181 0
ROI_number=0 #ROI = "region of interest", it's basically just the index for which character we're on
for b in boxes.splitlines(): #For every line in the string created by image_to_boxes()...
    b = b.split(' ') #Split the line into a list of strings, each string is a separate piece of data. So now, b[0] is character, b[1] is x1, b[2] is y1, b[3] is x2, b[4] is y2, and b[5] is 0
    char, x, y, w, h = b[0], int(b[1]), int(b[2]), int(b[3]), int(b[4]) #Store the pieces of data in variables with names that make sense (see comment in above line)
    print(char, end="") #Print out each character recongnized by image_to_boxes()
    x1,y1=hImg-h,hImg-y
    x2,y2=x,w
    roi=img[x1:y1,x2:y2]
    cv2.imwrite("charimages/"+str(ROI_number)+".jpeg",roi) #Save an image file for the character
    ROI_number+=1

Here is the output to the command line (which almost perfectly correct):

Output from image_to_string():
FPT ISBN 0-688-05913-4 >$22.95

IMPONDERABLES

The Solution to the
Mysteries of Everyday Life

David Feldman

Illustrated by Kas Schwan

Did you ever wonder why you never
see baby pigeons? Or why a thumbs-up
gesture means “OK”? At last the solu-
tions to some of life’s most baffling
questions are gathered here in one
volume. Written in an informative
and entertaining style and illustrated
with drawings that are clearly to the
point, Imponderables gets to the bottom
of everyday life’s mysteries, among
them:

* Why is a mile 5,280 feet?

* Which fruits are in Juicy Fruit*®
gum?

* Why does an X stand for a kiss?

* Why don’t cats like to swim?

* Why do other people hear our
voices differently than we do?

Dictionaries, encyclopedias, and
almanacs don’t have the answers—
Imponderables does! And in answering
such questions, it touches on an aston-
ishing variety of subjects, including

(continued on back flap)



First character of each line from the output of image_to_boxes():
FPTISBN0-688-05913-4>$22.95IMPONDERABLESTheSolutiontotheMysteriesofEverydayLifeDavidFeldmanIllustratedbyKasSchwan~Didyoueverwonderwhyyouneverseebabypigeons?Orwhyathumbs-upgesturemeans“OK”?Atlastthesolu-tionstosomeoflife’smostbafflingquestionsaregatheredhereinonevolume.Writteninaninformativeandentertainingstyleandillustratedwithdrawingsthatareclearlytothepoint,Imponderablesgetstothebottomofeverydaylife’smysteries,amongthem:*Whyisamile5,280feet?*WhichfruitsareinJuicyFruit*®gum?*WhydoesanXstandforakiss?*Whydon’tcatsliketoswim?*Whydootherpeoplehearourvoicesdifferentlythanwedo?Dictionaries,encyclopedias,andalmanacsdon’thavetheanswers—Imponderablesdoes!Andinansweringsuchquestions,ittouchesonanaston-ishingvarietyofsubjects,including(continuedonbackflap)~

But when it comes to the output image files, a lot of them are wrong. Some of the images are correct, but a lot of them are just... messed up. Take the image files corresponding to the word "IMPONDERABLES" as an example. There are 13 files, 1 for each character, which makes perfect sense. However, some of the images contain multiple characters:

https://i.sstatic.net/1QtKG.png

As far as I can tell, the problem originates with pytesseract.image_to_boxes(), which recognizes each character correctly but somehow doesn't recognize it's position and size correctly. Is there something I can do to make image_to_boxes() more accurate, or is there a different solution entirely?

stateMachine · Accepted Answer · 2022-09-28 00:47:44Z

Here's a purely OpenCV-based solution. You can produce bounding rectangles enclosing each character, the tricky part is to successfully and clearly segment each character. Image resolution is crucial for this, your image is quite small, and you can see at that DPI some characters appear to be joined. That's the issue you are facing. Adaptive Thresholding seems to somewhat alleviate the issue. Again, resolution is crucial and you would benefit from high-res images.

These are the steps:

Read and scale 2X your input image, because, again, it is pretty small
Convert to grayscale
Apply Adaptive Thresholding. You can tune two parameters for further improve the results: the Window Size and the Bias Constant per Window.
Get external contours on the binary mask
Get the bounding box of each character
Apply an aspect ratio and area filter to ignore noise
Crop filtered characters using numpy slicing

Let's see the code:

# Imports:
import numpy as np
import cv2

# Image path
path = "D://opencvImages//"
fileName = "fYYlu.png"

# Reading an image in default mode:
inputImage = cv2.imread(path + fileName)

# Scale image:
scaleFactor = 2
inputImage = cv2.resize(inputImage, None, fx=scaleFactor, fy=scaleFactor, interpolation=cv2.INTER_LINEAR)

# Deep Copy:
inputImageCopy = inputImage.copy()

# Convert RGB to grayscale:
grayscaleImage = cv2.cvtColor(inputImage, cv2.COLOR_BGR2GRAY)

# Adapt. threshold:
windowSize = 41
constantValue = 8
binaryImage = cv2.adaptiveThreshold(grayscaleImage, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,
                                    windowSize, constantValue)

So far I've managed to get this binary mask:

Which is pretty good, there is some noise and some characters are apparently joined, but let's work with this. Next, let's detect contours and apply a blob filter to ignore noise. The noise seems very small or very large. Let's set a lower and upper threshold to ignore those values. Additionally, the characters seem to be almost square-like, in the sense that their width/height ratio seems pretty close to 1.0:

# Find the EXTERNAL contours on the binary image:
contours, _ = cv2.findContours(binaryImage, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Look for the outer bounding boxes (no children):
for _, c in enumerate(contours):

    # Get the bounding rectangle:
    boundRect = cv2.boundingRect(c)

    # Draw the rectangle on the input image:
    # Get the dimensions of the bounding rect:
    rectX = int(boundRect[0])
    rectY = int(boundRect[1])
    rectWidth = int(boundRect[2])
    rectHeight = int(boundRect[3])

    # Compute contour area:
    contourArea = rectHeight * rectWidth

    # Compute aspect ratio:
    referenceRatio = 1.0
    contourRatio = rectWidth / rectHeight
    epsilon = 1.1
    ratioDifference = abs(referenceRatio - contourRatio)
    print((ratioDifference, contourArea))

    # Red color, filtered blobs:
    color = (0, 0, 255)

    # Apply contour filter:
    if ratioDifference <= epsilon:  # Aspect Ratio
        minArea = 50 * scaleFactor
        maxArea = 120 * minArea

        if minArea <= contourArea < maxArea:   # Area Filter
            # Crop contour:
            croppedChar = inputImage[rectY:rectY + rectHeight, rectX:rectX + rectWidth]
            cv2.imshow("Cropped Character", croppedChar)
            cv2.waitKey(0)

            #  Green Color, detected blobs:
            color = (0, 255, 0)

    cv2.rectangle(inputImageCopy, (int(rectX), int(rectY)),
                  (int(rectX + rectWidth), int(rectY + rectHeight)), color, 2)

    # (Optional) Show image:
    cv2.imshow("Bounding Rectangle", inputImageCopy)
    cv2.waitKey(0)

This image shows in green the "valid" character boxes, and in red the filtered ones:

You can tune out the results by fiddling with the adaptive threshold parameters, this GIF shows various results for ascending, odd, Window Size (WS) values. The higher the window size the better - up until a certain point in which characters will start joining in bigger clusters:

If the characters are too close together, it might be possible to use erode to make them more distinct.
I should probably have mentioned that I lowered the resolution of my input image in order to upload it, it's actually 950x1558. I expected it to be embedded in the post instead of displayed as a link, so I didn't want it to be too big. Anyways, I just tried your solution and it worked for isolating the characters, but now the image files aren't named after the character they show because I'm not doing any OCR. I'll try to work that out on my own. Thanks for the help!

Collectives™ on Stack Overflow

Python tesseract and opencv - image_to_boxes() getting the wrong positions for characters

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related