iOS .Tesseract OCR why recognition is so pure. Engine principle

Question

I have a question about Tesseract OCR principle. As far as I understand, after shapes detection , symbols (their forms) are scaled(resized) to have some specific font size. Such font size is based on trained data. Basically, trained set defines symbols (their geometry,shape), maybe their representation.

I am using Tesseract 3.01 (the latest) version on iOS platform. I check Tesseract FAQ, looked at forum, but I do not understand why for some images I have low quality of recognition.

It is said that font should be bigger than 12pt & image should have more than 300 DPI. I did all necessary preprocessing such as blurring (if it is needed), contrast enhancement. I even used other engine in Tesseract OCR - it is called CUBE.

But for some images (in spite of fact that they are bigger MIN(width, height) >1000 - I rescale them for tesseract, I get bad results for recognition

https://www.dropbox.com/sh/3nogs7e3ixc2sik/0YsPh2Tr7w

However on other set of images results are better:

https://www.dropbox.com/sh/92chi5zq1zp3zje/PPJ7ortR_a

Those images smaller I do not resize them, (just convert to grayscale mode).

If what I wrote about engine is correct.

Suppose trained set is based on font with size 14pt. Symbols from pictures are resized to some specific size, and I do not see any reason why they are not recognised in such case.

I also tried custom dictionaries, to penalise non dictionary words - did not give too much benefit to recognition.

tesseract = new tesseract::TessBaseAPI();

GenericVector<STRING> variables_name(1),variables_value(1);
variables_name.push_back("user_words_suffix");
variables_value.push_back("user-words");

int retVal = tesseract->Init([self.tesseractDataPath cStringUsingEncoding:NSUTF8StringEncoding], NULL,tesseract::OEM_TESSERACT_ONLY, NULL, 0, &variables_name, &variables_value, false);
ok |= retVal == 0;
ok |= tesseract->SetVariable("language_model_penalty_non_dict_word", "0.2");
ok |= tesseract->SetVariable("language_model_penalty_non_freq_dict_word", "0.2");

if (!ok)
{
    NSLog(@"Error initializing tesseract!");
}

So my question is should I train tesseract on another font?

And ,honestly speaking, why I should train it? on default trained data text from Internet, or screen of PC(Mac) I get good recognition.

I also checked original tesseract English trained data it has 38 tiff files, that belong to the following families: 1) Аrial 2) verdana 3 )trebuc 4) times 5) georigia 6 ) cour

It seems that font from image does not belong to this set.

I think that the image should be (descew & dewarp.)stackoverflow.com/questions/12275259/… — Siarhei Yakushevich
– Siarhei Yakushevich, Commented Nov 22, 2013 at 8:42

Mariusz Ignatowicz · Accepted Answer · 2014-09-27 21:30:46Z

1

In your case the size of the image is not the problem. As I can see from your attached images (and I'm surprised that nobody mentioned it before) the problem is that the text on images from which you get bad results is not placed on straight lines.

One of the things that Tesseract does at early stages of OCR process is to detect image layout and extracting whole lines of text.

This image is the best example to illustrate this part of the process:

Tesseract lines extraction

As you can see the engine is expecting the text to be perpendicular to the edge of the image.

answered Sep 27, 2014 at 21:30

Mariusz Ignatowicz

1,6923 gold badges22 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bhumeshwer katre · Accepted Answer · 2013-11-21 08:09:32Z

0

If you done with all necessary image processing then try this, It may helpful for you

 CGSize size = [image size];
 int width = size.width;
 int height = size.height;

 uint32_t* _pixels = (uint32_t *) malloc(width * height * sizeof(uint32_t));
 if (!_pixels) {
      return;//Invalid image
   }

 // Clear the pixels so any transparency is preserved
  memset(_pixels, 0, width * height * sizeof(uint32_t));

  CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();

  // Create a context with RGBA _pixels
  CGContextRef context = CGBitmapContextCreate(_pixels, width, height, 8, width * sizeof(uint32_t), colorSpace,kCGBitmapByteOrder32Little | kCGImageAlphaPremultipliedLast);

  // Paint the bitmap to our context which will fill in the _pixels array
    CGContextDrawImage(context, CGRectMake(0, 0, width, height), [image CGImage]);

  // We're done with the context and color space
    CGContextRelease(context);
    CGColorSpaceRelease(colorSpace);

    _tesseract->SetImage((const unsigned char *) _pixels, width, height, sizeof(uint32_t), width * sizeof(uint32_t));


    _tesseract->SetVariable("tessedit_char_whitelist", ".#0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/-!");
    _tesseract->SetVariable("tessedit_consistent_reps", "0");


    char* utf8Text = _tesseract->GetUTF8Text();
    NSString *str = nil;

   if (utf8Text) {
       str =  [NSString stringWithUTF8String:utf8Text];
     }

answered Nov 21, 2013 at 8:09

Bhumeshwer katre

4,6692 gold badges21 silver badges29 bronze badges

4 Comments

Siarhei Yakushevich Over a year ago

Thx Katre, I tried & it did not help. Now currently instead of seeing different type of recognition trash I observed "trash" from characters list's (tessedit_char_whitelist).

Bhumeshwer katre Over a year ago

Thorugh this code I can get 90% accurarte result. Then may be problem with your image processing only. Try with sample image what you captured, don't process image. And see the result difference.

Siarhei Yakushevich Over a year ago

if it's not a big deal can u try your OCR on one of "Bad" images, mentioned above (dropbox.com/scl/fo/y93zklnuvwzhirksk1fz1/… ).

Bhumeshwer katre Over a year ago

For this type of image no clue in my hand sorry for that.

Collectives™ on Stack Overflow

iOS .Tesseract OCR why recognition is so pure. Engine principle

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related