1,674 questions
3
votes
2
answers
108
views
How do I Download Poppler and Tesseract Programmatically with PowerShell
In Python, there are two libraries which are often used in tandem, Poppler and Tesseract. They both need external downloads to function:
Poppler, Tesseract. The general recommendation for Windows is ...
-4
votes
1
answer
138
views
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 270: invalid start byte - Why? [closed]
I'm doing an ultra-simple web page scraper using Python/Beautifulsoup.
Facing a key information displayed as PNG image, I've had to reach for PIL/Pytesseract.
Code being extremely simple, and working ...
0
votes
0
answers
39
views
Error: Deserialize header failed: 1.lstmf when training new data
I need to train the default eng data, so that it can also recognize seom new characters. I created box files and lstm files and when running cmd:
lstmtraining \
--model_output output/eng_latin \
--...
0
votes
2
answers
74
views
Pytesseract cannot always understand very simple and clear text (font Consolas)
Pytesseract cannot understand very simple and clear text. I've tried nearest neighbor, bilinear, gaussian blur, and everything else and cannot get tesseract to read the text consistently, the best I ...
1
vote
0
answers
184
views
How to set Tesseract PSM in Docling (Python)
I’m using Docling to OCR scanned PDFs. I want to control Tesseract’s page-segmentation mode (PSM), e.g. --psm 6.
Docling exposes both TesseractOcrOptions and TesseractCliOcrOptions, but neither ...
2
votes
1
answer
70
views
Tesseract unable to recognise the letter O in plain image
I'm attempting to perform OCR on a set of single letters inside an image using Python. I'm new to this so apologies if I get the terminology wrong, but I've filtered and have obtained (I think) quite ...
1
vote
2
answers
246
views
Why do I get nothing in output with pytesseract?
I have installed language support for chi_sim:
ls /usr/share/tesseract-ocr/5/tessdata
chi_sim.traineddata eng.traineddata pdf.ttf
configs osd.traineddata tessconfigs
You can try it by ...
1
vote
1
answer
83
views
When I Try To Train a Tesseract Model I get a Compute CTC targets failed error
I am currently using tesseract 5.0 and am training a model. I have generated the png, box and the ground truth files for a thousand images. However, when I run the command:
make training MODEL_NAME=...
0
votes
1
answer
167
views
How to get good OCR results using pytesseract
I'm trying to get the data out of this image:
and no matter what I try I can't get a good result.
I have tried ImageEnhance and cv2
I got the most promising result using cv2 and adaptive Treshold:
...
1
vote
1
answer
81
views
Tesseract doesn't find page numbers
I have a PDF document that I want to scan with pytesseract, but the page numbers are not recognized. The page number is not recognized on any of the pages. The PDF is written with Latex. I ried ...
0
votes
1
answer
62
views
Prevent tesseract guessing characters based on surrounding context instead of just the character outline
I'm using pytesseract to read tabular data out of an image but I'm having trouble with the software making "educated guesses" about characters and word splitting based on context.
I have a ...
0
votes
0
answers
62
views
lstm-unicharset file is unable to be created during tesseract training
I am trying to fine-tune an Optical Character Recognition (OCR) model on Tesseract's provided tesstrain repository for Japanese . I tried encoding the bash commands into Python in VSCode as I wanted ...
0
votes
0
answers
147
views
Tesseract OCR Command in ocrmypdf Fails with 'SubprocessOutputError' on Windows
ExitCodeException _common.py:271
Traceback (most recent call last):
File "C:\<USER>\apps\python\...
1
vote
1
answer
159
views
Tesseract Training: Error 'Integer (fast) model' When Using Apex.lstm
I’ve been following this tutorial from YouTube:
Guide to Tesseract Training
https://www.youtube.com/watch?v=KE4xEzFGSU8&t=13s
and its corresponding GitHub repository: astutejoe/tesseract_tutorial.
...
-1
votes
1
answer
68
views
I'm having trouble trying to convert image to text in python
I'm trying to convert the attached image using the pytesseract and opencv libraries in python, but the conversion is not satisfactory, since many characters are converted incorrectly. Does anyone have ...
-1
votes
1
answer
58
views
Pytesseract not recognize text from image in Python
I am working with a Django application, there for some purpose i need to solve captcha i am already saving temporary captcha file but when i try to read captcha using pytesseract it return nothing ...
2
votes
1
answer
531
views
Image Preprocessing to extract 2D number list
I've been tring to make a puzzle solving program. The game is 'fruit box' and you can play it through the link below.
https://en.gamesaien.com/game/fruit_box/
To do that, I have to extract numbers ...
3
votes
0
answers
107
views
Memory Usage Keeps Increasing in Python Script Using OpenCV, PyAutoGUI, and Tesseract OCR [closed]
I'm working on a Python script that continuously monitors a screen region, extracts text using Tesseract OCR, and sends serial commands to an Arduino based on the detected text. However, I notice that ...
0
votes
1
answer
40
views
Pytesseract numbers image to text
I am trying to use pytesseract to extract numbers from images.
It works for some of them (1, 2, 3, 5, 6, 20...) but I would like to make it work for all of them.
Here is a sample of the data that I'm ...
0
votes
0
answers
72
views
PyTesseract and 7 segment numbers, how to get confidence of recognition?
I need to recognize digits on 7 seg clocks(see picture below), so I use following python code:
def detect_date(image: cv2.UMat, bbox:list) -> datetime:
gry1 = cv2.cvtColor(image, ...
0
votes
0
answers
51
views
Extracting data from a table with known labels with tesseract
I am trying to use Tesseract to create a small Windows application that allows the user to:
Take a screenshot of the monitor and cut a smaller portion containing a table (the table always has the ...
0
votes
0
answers
24
views
tesseract lost the language pack
C:\Users\xwmsu>tesseract --list-langs
Error opening data file \app\Tesseract-OCR\tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of ...
0
votes
0
answers
44
views
Pytesseract splitting a line
I'm new to using Pytesseract, and I'm having trouble recognizing an image:
Bet Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\Tesseract.exe'
# ...
0
votes
0
answers
37
views
How can extract the content from image using python with the pytesseract?
I tried to extract the content from an image with the Python py-tesseract OCR, but I was unable to obtain the numbers. I get the extracted_text empty value.
Code:
def ImageReader(image_path):
...
-1
votes
1
answer
125
views
Pytesseract wrong text recognition when word are close to each other
When I use PyTesseract to recognize the text in this image, it returns 'FORREST C. BLopGetTrT' instead of FORREST C. BLODGETT The result of code i get
the image i use, which contains many name.
I ...
1
vote
1
answer
239
views
Pytesseract TesseractError: Unable to Load Language Files
I am trying to use pytesseract in my system. But I am getting the following error message
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/share/eng.traineddata ...
2
votes
0
answers
111
views
tesseract not able to find .lstm-unicharset file while performing model training
I am using tesseract to perfrom custom model training. I have created my own text dataset and saved in tesstrain->data->codec folder with images and corresponding .gt files. At the same level as ...
2
votes
1
answer
195
views
OCR character recognition fails
I am experimenting with AI and specifically character recognition. I saw that one of the best algorithms is OCR and Google's implementation in Tesseract seems like the best open source solution right ...
0
votes
2
answers
177
views
How do i get around this permission error with tesseract-ocr
i am doing a python project, in which i use Tesseract-OCR. when i set it up from git, it gave me this error:
C:\Users\jpmv1\AppData\Local\Programs\Python\Python312\python.exe C:\Users\jpmv1\Projects\...
1
vote
1
answer
99
views
TesseractNot Found Error is displayed after deploying app on render even after trying several methods [closed]
I am trying to deploy app through render but after executing there is error as
TesseratNotFound or Tesseract is not installed
Even though I have added package.txt , requirements.txt as well as build....
0
votes
0
answers
94
views
How to extract data from pdfs which are not in tables or containers into a column based table format in python?
I am trying to convert my pdf data into structured table format data. I have tried bunch of options but none of them have been able to separate fields into columns of table format. I am able to do ...
0
votes
1
answer
134
views
Trying to convert a PDF to a JPEG but I keep facing an error
I'm trying to convert a PDF into a JPEG using python. I'm trying to perform OCR by converting the PDF's into JPEG but keep running into the error:
cannot identify image file <_io.BytesIO object at ...
2
votes
1
answer
292
views
what is the best way to recognize embossed text with Tesseract OCR?
I am trying to read the text from a U.S. penny to orient the coin.
the original is from
https://www.usmint.gov/wordpress/wp-content/uploads/2024/05/2024-lincoln-penny-uncirculated-obverse-philadelphia....
-1
votes
1
answer
105
views
Use pytesseract OCR to read text from a captcha
I need to use Pytesseract to extract text from this picture:
I'm using this code:
import pytesseract
import cv2
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract....
0
votes
0
answers
120
views
Unable to Extract Text from Image Using Tesseract OCR - How to Preprocess Instagram Reels Frames
I am working on a project where I need to extract text from frames of an Instagram Reels video. I used the yt-dlp to download the video, extracted frames using ffmpeg, and attempted to read the text ...
0
votes
1
answer
271
views
Problem to extract correct data from PDF with tesseract
I'm trying to extract specific data from multiple PDFs. I begin by isolating the example image (Picture 1) using horizontal and vertical lines to create cells. After creating the cells, I crop them ...
0
votes
0
answers
47
views
PyTesseract not extracting text?
Pytesseract does not extract the text from the image. The terminal stays black with a space as if it was actually trying to extract the text.
Here is my code and the image:
from PIL import Image
...
0
votes
0
answers
262
views
How can I extract tables from an image into excel using optical character recognition?
As an example, I have this image and will like to convert this to an modifiable excel table. In have tried using the 'pytesseract' library, but it doesn't accurately extract the text from the image ...
0
votes
1
answer
155
views
How to recognize single characters from an image using Tesseract?
This is the original image:
This is the processed image:
I'm trying to automate a mini-game, in which characters appear on the screen. I did some light reaserch and managed to process the image to ...
0
votes
0
answers
63
views
OpenCV contours sorting x-axis and y-axis
I am working on a python program to solve a wordsearch. I am using pytesseract and opencv to process an image of the wordsearch and the solution will be displayed as a text. The script processes the ...
0
votes
1
answer
95
views
Getting numbers from matrix image using pytesseract
I am trying to retrieve the text from an image that is a matrix 4x4. The text are numbers. Although I was expecting the numbers all I got was: BE, 8, EEE, BE. The image is attached here: image
Anyone ...
1
vote
1
answer
148
views
Pytesseract OCR recognizes "o" as "0"
I'm trying to read text on this image using pytesseract library.
original-screenshot.png
Here is my code:
path = 'original-screenshot.png'
image = cv2.imread(path)
image = cv2.cvtColor(image, cv2....
1
vote
0
answers
50
views
I don't want the boxes to be read as special character or letters
This is the image:
This is the sample image that i will convert into text.
And here is the output:
***"|
| .**
indicators (Bids:
S.1.4.1. valid Certificate of Registration and **LJ Poy |**
...
3
votes
1
answer
115
views
Incorrect digit detection using Tesseract OCR on video frames in Python
I'm trying to calculate the real time of video recording. I have a lot of videos, some of which were lost during transmission. All of them are in mp4 format. to get the duration, I recognize the time ...
-1
votes
1
answer
125
views
Unable to solve the captcha correctly using pytesseract
I have created a python code to read the captcha using OCR and fill the form further. I have used pytesseract library for the recognition of characters in the captcha. I am unable to retrieve the ...
1
vote
0
answers
130
views
Improving OCR accuracy with pytesseract for processing manga images
def get_string(img_path):
img = cv2.imread(img_path)
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
...
0
votes
1
answer
99
views
OCR and pytesseract detecting numbers in an image
currentbid.png:
I am trying to detect the number in this image and it gives me letters or the wrong number.
This is my image i am trying to detect the number ive tried tons of stuff with greyscale ...
0
votes
0
answers
236
views
How to read small numbers on given image using PyTesseract
I am trying to use OpenCV and Pytesseract to loop over the white numbers at the bottom of this image (or similar images) and record each number.
While I have the logic correct for determining the ROI,...
1
vote
0
answers
27
views
I want a more detailed square using pytesseract
I want to make a code to extract the x-axis numbers and x-axis labels in the chart. I hope the numbers and labels are separated. Is there a way to solve it?
Recognize the x-axis y-axis and classify it ...
0
votes
2
answers
257
views
Appium: identifying iOS elements using pytesseract instead of locators
Below is a snapshot of our application in test. iOS app in react native. The hierarchy is too deep.
We are already using snapshotmaxdepth - 60 as one of the capabilities.
Other capabilities include ...