Improving OCR detection using tesseract

Ask Question

Asked 7 months ago

Modified 7 months ago

Viewed 111 times

I’m working on a project where I have to detect objects in a PDF document. After detecting the objects, I need to read the text at this location since it will be used as the object's name.

Example:

I’ve managed to detect the objects, I’ve used OpenCV to preprocess the image and want to use Tesseract to read the text from the image.

I’ve used a high resolution image in order to improve Tesseracts accuracy.

I’ve tried using a whitelist, a wordlist and a pattern to further improve tesseracts accuracy. Also, I’ve been playing around with different page segmentation modes, like PSM_SINGLE_WORD and PSM_SINGLE_BLOCK.

Sometimes tesseract reads the text correctly, e.g. the first image returns “T2,T3\n” using PSM_SINGLE_WORD (not using PSM_SINGLE_BLOCK, this returns “12,13\n”). However, in most cases it doesn’t return the correct text.

Preprocessed images for reference:

1st:
Word:  "T2,T3\n"
Block:  "12,13\n"
Expected: “T2,T3\n”

2nd:
Word:  "T2,T3,TAR3\n"
Block:  "12,13, 1T AR3\n"
Expected: “T2,T3,TAR3\n”

3rd:
Word:  "TA8\n"
Block:  "TAR8\n"
Expected: “TAR8\n”

4th:
Word:  "T2TT\n"
Block:  "12,13,14,\nTAR35TAR4\n"
Expected: “T2,T3,T4,\nTAR3,TAR4”

5th:
Word:  "TTT2AA,RRT333A,,\n"
Block:  "12,13,\nTAR35\nTA34\n"
Expected: “T2,T3,\nTAR3,\nTAR34\n”

6th:
Word:  "T15\n"
Block:  "TAR15\n"
Expected: “TAR15\n”

7th:
Word:  "T\n"
Block:  "111\n"
Expected: “T11\n”

As you can see, sometimes PSM_SINGLE_WORD returns better results, sometimes PSM_SINGLE_BLOCK does and sometimes neither returns the correct result.

Since I have quite a few of different variations in the images and I don’t understand why some characters are detected incorrectly (e.g. “,” as “5” in 4th) I’m looking for assistance in resolving this problem.

The relevant code snippet is the following:

Pix* pixImage = pixCreate(eroded.cols, eroded.rows, 8);
for (int y = 0; y < eroded.rows; y++) {
    for (int x = 0; x < eroded.cols; x++) {
        pixSetPixel(pixImage, x, y, eroded.at<uchar>(y, x));
    }
}

QString dataDir = qApp->applicationDirPath() + QStringLiteral("/tessdata");
QString d = QDir::toNativeSeparators(dataDir);

tesseract::TessBaseAPI tess;
QString result;

// First pass: PSM_SINGLE_WORD
if (tess.Init(d.toLatin1(), "eng", tesseract::OEM_DEFAULT) == 0) {
    tess.SetPageSegMode(tesseract::PSM_SINGLE_WORD);
    tess.SetVariable("tessedit_char_whitelist", "TAR0123456789, ");
    tess.SetVariable("user_words_file", "wordList.txt");
    tess.SetVariable("user_patterns_file", "patterns.txt");
    tess.SetVariable("load_system_dawg", "0");
    tess.SetVariable("load_freq_dawg", "0");
    tess.SetVariable("wordrec_enable_assoc", "0");
    tess.SetVariable("use_only_my_words", "1");
    tess.SetImage(pixImage);

    QString wordResult = QString::fromUtf8(tess.GetUTF8Text());
    qDebug() << "Word: " << wordResult;
    result += wordResult;
    tess.End();
}

// Second pass: PSM_SINGLE_BLOCK
if (tess.Init(d.toLatin1(), "eng", tesseract::OEM_DEFAULT) == 0) {
    tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
    tess.SetVariable("tessedit_char_whitelist", "TAR0123456789, ");
    tess.SetVariable("user_words_file", "wordList.txt");
    tess.SetVariable("user_patterns_file", "patterns.txt");
    tess.SetVariable("load_system_dawg", "0");
    tess.SetVariable("load_freq_dawg", "0");
    tess.SetVariable("wordrec_enable_assoc", "0");
    tess.SetVariable("use_only_my_words", "1");
    tess.SetImage(pixImage);

    QString blockResult = QString::fromUtf8(tess.GetUTF8Text());
    qDebug() << "Block: " << blockResult;
    result += blockResult;
    tess.End();
}

pixDestroy(&pixImage);

Since this is my second question asked on Stack Overflow I might be missing some information, so please feel free to ask for anything you might required to help me.

edited Apr 8 at 16:20

asked Apr 8 at 9:39

Jonathan Fischer

414 bronze badges

One thing that would help is if you managed to resolve the dependency on C++ or at least Qt and provided a minimal reproducible example. I guess you could reproduce this with just the tesseract command line application, too. Also, concerning the image above, is that what you feed to Tesseract or do you first transform in any additional way in the (non shown) C++ code?

Ulrich Eckhardt
– Ulrich Eckhardt

2025-04-08 09:57:41 +00:00
Commented Apr 8 at 9:57
1

5 instead of comma is a scale issue, as they are similar shapes.

Joop Eggen
– Joop Eggen

2025-04-08 10:17:02 +00:00
Commented Apr 8 at 10:17
1

OCR is never 100% accurate, so you're not in an unexpected situation, especially with a free OCR tool. Nonetheless it should work better on such nice data (high resolution non-blurred glyphs, no mix of fonts or similar glyphs in a font) as you show, so there's definitely potential for improvement. @JoopEggen has a good suggestion, scaling is a common issue. For example in Tesseract GUI I have installed, cropping to a box including black border gives "Tar3,Tar4" while cropping only to yellow area gives "Tar3,Tar4d".

Xellos
– Xellos

2025-04-08 13:13:03 +00:00
Commented Apr 8 at 13:13
1

Also differently sized uppercase is an issue - try to specify the font if you can so OCR would have better options for what characters can appear at what relative size. I found one link about it.

Xellos
– Xellos

2025-04-08 13:17:28 +00:00
Commented Apr 8 at 13:17
I'm terribly sorry, I intended to add the preprocessed images and forgot. I'll add them now.

Jonathan Fischer
– Jonathan Fischer

2025-04-08 16:15:39 +00:00
Commented Apr 8 at 16:15

| Show 1 more comment

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Improving OCR detection using tesseract

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked