I'm trying to use tesseract and opencv in Python to extract every character from an image and save each character to an individual image file. My code has no problem recognizing the text properly and printing it out, but it's not recognizing the position and size of the individual characters properly. Here's the input image:
https://i.sstatic.net/fYYlu.png
Here's my code:
#=Imports======================================================================
import cv2
import sys
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Users\User\AppData\Local\Tesseract-OCR\tesseract.exe'
import math
from PIL import ImageGrab
#=Main=Code====================================================================
#Read in image
img = cv2.imread("feldman.png")
#Processing to make the image suitable for OCR
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) #Convert image to greyscale
img = cv2.threshold(img, 190, 255, cv2.THRESH_BINARY)[1] #Apply threshold effect
#Perform OCR and print to command line
print("Output from image_to_string():")
print(pytesseract.image_to_string(img))
#Save each character as an image
print("")
print("First character of each line from the output of image_to_boxes():")
hImg, wImg = img.shape #Get the dimensions of the image
boxes = pytesseract.image_to_boxes(img) #Analyzes where boxes would be drawn around each character in an image and creates a string with many lines, one line per box, each line containing data about its box. The data structure for each line/box is: character x1 y1 x2 y2 0 (not sure what the last one is but it's always 0), for example: s 596 164 609 181 0
ROI_number=0 #ROI = "region of interest", it's basically just the index for which character we're on
for b in boxes.splitlines(): #For every line in the string created by image_to_boxes()...
b = b.split(' ') #Split the line into a list of strings, each string is a separate piece of data. So now, b[0] is character, b[1] is x1, b[2] is y1, b[3] is x2, b[4] is y2, and b[5] is 0
char, x, y, w, h = b[0], int(b[1]), int(b[2]), int(b[3]), int(b[4]) #Store the pieces of data in variables with names that make sense (see comment in above line)
print(char, end="") #Print out each character recongnized by image_to_boxes()
x1,y1=hImg-h,hImg-y
x2,y2=x,w
roi=img[x1:y1,x2:y2]
cv2.imwrite("charimages/"+str(ROI_number)+".jpeg",roi) #Save an image file for the character
ROI_number+=1
Here is the output to the command line (which almost perfectly correct):
Output from image_to_string():
FPT ISBN 0-688-05913-4 >$22.95
IMPONDERABLES
The Solution to the
Mysteries of Everyday Life
David Feldman
Illustrated by Kas Schwan
Did you ever wonder why you never
see baby pigeons? Or why a thumbs-up
gesture means “OK”? At last the solu-
tions to some of life’s most baffling
questions are gathered here in one
volume. Written in an informative
and entertaining style and illustrated
with drawings that are clearly to the
point, Imponderables gets to the bottom
of everyday life’s mysteries, among
them:
* Why is a mile 5,280 feet?
* Which fruits are in Juicy Fruit*®
gum?
* Why does an X stand for a kiss?
* Why don’t cats like to swim?
* Why do other people hear our
voices differently than we do?
Dictionaries, encyclopedias, and
almanacs don’t have the answers—
Imponderables does! And in answering
such questions, it touches on an aston-
ishing variety of subjects, including
(continued on back flap)
First character of each line from the output of image_to_boxes():
FPTISBN0-688-05913-4>$22.95IMPONDERABLESTheSolutiontotheMysteriesofEverydayLifeDavidFeldmanIllustratedbyKasSchwan~Didyoueverwonderwhyyouneverseebabypigeons?Orwhyathumbs-upgesturemeans“OK”?Atlastthesolu-tionstosomeoflife’smostbafflingquestionsaregatheredhereinonevolume.Writteninaninformativeandentertainingstyleandillustratedwithdrawingsthatareclearlytothepoint,Imponderablesgetstothebottomofeverydaylife’smysteries,amongthem:*Whyisamile5,280feet?*WhichfruitsareinJuicyFruit*®gum?*WhydoesanXstandforakiss?*Whydon’tcatsliketoswim?*Whydootherpeoplehearourvoicesdifferentlythanwedo?Dictionaries,encyclopedias,andalmanacsdon’thavetheanswers—Imponderablesdoes!Andinansweringsuchquestions,ittouchesonanaston-ishingvarietyofsubjects,including(continuedonbackflap)~
But when it comes to the output image files, a lot of them are wrong. Some of the images are correct, but a lot of them are just... messed up. Take the image files corresponding to the word "IMPONDERABLES" as an example. There are 13 files, 1 for each character, which makes perfect sense. However, some of the images contain multiple characters:
https://i.sstatic.net/1QtKG.png
As far as I can tell, the problem originates with pytesseract.image_to_boxes(), which recognizes each character correctly but somehow doesn't recognize it's position and size correctly. Is there something I can do to make image_to_boxes() more accurate, or is there a different solution entirely?