2

I’m working on a project where I have to detect objects in a PDF document. After detecting the objects, I need to read the text at this location since it will be used as the object's name.

Example: enter image description here

I’ve managed to detect the objects, I’ve used OpenCV to preprocess the image and want to use Tesseract to read the text from the image.

I’ve used a high resolution image in order to improve Tesseracts accuracy.

I’ve tried using a whitelist, a wordlist and a pattern to further improve tesseracts accuracy. Also, I’ve been playing around with different page segmentation modes, like PSM_SINGLE_WORD and PSM_SINGLE_BLOCK.

Sometimes tesseract reads the text correctly, e.g. the first image returns “T2,T3\n” using PSM_SINGLE_WORD (not using PSM_SINGLE_BLOCK, this returns “12,13\n”). However, in most cases it doesn’t return the correct text.

Preprocessed images for reference: enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here

1st:
Word:  "T2,T3\n"
Block:  "12,13\n"
Expected: “T2,T3\n”

2nd:
Word:  "T2,T3,TAR3\n"
Block:  "12,13, 1T AR3\n"
Expected: “T2,T3,TAR3\n”

3rd:
Word:  "TA8\n"
Block:  "TAR8\n"
Expected: “TAR8\n”

4th:
Word:  "T2TT\n"
Block:  "12,13,14,\nTAR35TAR4\n"
Expected: “T2,T3,T4,\nTAR3,TAR4”

5th:
Word:  "TTT2AA,RRT333A,,\n"
Block:  "12,13,\nTAR35\nTA34\n"
Expected: “T2,T3,\nTAR3,\nTAR34\n”

6th:
Word:  "T15\n"
Block:  "TAR15\n"
Expected: “TAR15\n”

7th:
Word:  "T\n"
Block:  "111\n"
Expected: “T11\n”

As you can see, sometimes PSM_SINGLE_WORD returns better results, sometimes PSM_SINGLE_BLOCK does and sometimes neither returns the correct result.

Since I have quite a few of different variations in the images and I don’t understand why some characters are detected incorrectly (e.g. “,” as “5” in 4th) I’m looking for assistance in resolving this problem.

The relevant code snippet is the following:

Pix* pixImage = pixCreate(eroded.cols, eroded.rows, 8);
for (int y = 0; y < eroded.rows; y++) {
    for (int x = 0; x < eroded.cols; x++) {
        pixSetPixel(pixImage, x, y, eroded.at<uchar>(y, x));
    }
}

QString dataDir = qApp->applicationDirPath() + QStringLiteral("/tessdata");
QString d = QDir::toNativeSeparators(dataDir);

tesseract::TessBaseAPI tess;
QString result;

// First pass: PSM_SINGLE_WORD
if (tess.Init(d.toLatin1(), "eng", tesseract::OEM_DEFAULT) == 0) {
    tess.SetPageSegMode(tesseract::PSM_SINGLE_WORD);
    tess.SetVariable("tessedit_char_whitelist", "TAR0123456789, ");
    tess.SetVariable("user_words_file", "wordList.txt");
    tess.SetVariable("user_patterns_file", "patterns.txt");
    tess.SetVariable("load_system_dawg", "0");
    tess.SetVariable("load_freq_dawg", "0");
    tess.SetVariable("wordrec_enable_assoc", "0");
    tess.SetVariable("use_only_my_words", "1");
    tess.SetImage(pixImage);

    QString wordResult = QString::fromUtf8(tess.GetUTF8Text());
    qDebug() << "Word: " << wordResult;
    result += wordResult;
    tess.End();
}

// Second pass: PSM_SINGLE_BLOCK
if (tess.Init(d.toLatin1(), "eng", tesseract::OEM_DEFAULT) == 0) {
    tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
    tess.SetVariable("tessedit_char_whitelist", "TAR0123456789, ");
    tess.SetVariable("user_words_file", "wordList.txt");
    tess.SetVariable("user_patterns_file", "patterns.txt");
    tess.SetVariable("load_system_dawg", "0");
    tess.SetVariable("load_freq_dawg", "0");
    tess.SetVariable("wordrec_enable_assoc", "0");
    tess.SetVariable("use_only_my_words", "1");
    tess.SetImage(pixImage);

    QString blockResult = QString::fromUtf8(tess.GetUTF8Text());
    qDebug() << "Block: " << blockResult;
    result += blockResult;
    tess.End();
}

pixDestroy(&pixImage);

Since this is my second question asked on Stack Overflow I might be missing some information, so please feel free to ask for anything you might required to help me.

6
  • One thing that would help is if you managed to resolve the dependency on C++ or at least Qt and provided a minimal reproducible example. I guess you could reproduce this with just the tesseract command line application, too. Also, concerning the image above, is that what you feed to Tesseract or do you first transform in any additional way in the (non shown) C++ code? Commented Apr 8 at 9:57
  • 1
    5 instead of comma is a scale issue, as they are similar shapes. Commented Apr 8 at 10:17
  • 1
    OCR is never 100% accurate, so you're not in an unexpected situation, especially with a free OCR tool. Nonetheless it should work better on such nice data (high resolution non-blurred glyphs, no mix of fonts or similar glyphs in a font) as you show, so there's definitely potential for improvement. @JoopEggen has a good suggestion, scaling is a common issue. For example in Tesseract GUI I have installed, cropping to a box including black border gives "Tar3,Tar4" while cropping only to yellow area gives "Tar3,Tar4d". Commented Apr 8 at 13:13
  • 1
    Also differently sized uppercase is an issue - try to specify the font if you can so OCR would have better options for what characters can appear at what relative size. I found one link about it. Commented Apr 8 at 13:17
  • I'm terribly sorry, I intended to add the preprocessed images and forgot. I'll add them now. Commented Apr 8 at 16:15

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.