I have a question about Tesseract OCR principle. As far as I understand, after shapes detection , symbols (their forms) are scaled(resized) to have some specific font size. Such font size is based on trained data. Basically, trained set defines symbols (their geometry,shape), maybe their representation.
I am using Tesseract 3.01 (the latest) version on iOS platform. I check Tesseract FAQ, looked at forum, but I do not understand why for some images I have low quality of recognition.
It is said that font should be bigger than 12pt & image should have more than 300 DPI. I did all necessary preprocessing such as blurring (if it is needed), contrast enhancement. I even used other engine in Tesseract OCR - it is called CUBE.
But for some images (in spite of fact that they are bigger MIN(width, height) >1000 - I rescale them for tesseract, I get bad results for recognition
https://www.dropbox.com/sh/3nogs7e3ixc2sik/0YsPh2Tr7w
However on other set of images results are better:
https://www.dropbox.com/sh/92chi5zq1zp3zje/PPJ7ortR_a
Those images smaller I do not resize them, (just convert to grayscale mode).
If what I wrote about engine is correct.
Suppose trained set is based on font with size 14pt. Symbols from pictures are resized to some specific size, and I do not see any reason why they are not recognised in such case.
I also tried custom dictionaries, to penalise non dictionary words - did not give too much benefit to recognition.
tesseract = new tesseract::TessBaseAPI();
GenericVector<STRING> variables_name(1),variables_value(1);
variables_name.push_back("user_words_suffix");
variables_value.push_back("user-words");
int retVal = tesseract->Init([self.tesseractDataPath cStringUsingEncoding:NSUTF8StringEncoding], NULL,tesseract::OEM_TESSERACT_ONLY, NULL, 0, &variables_name, &variables_value, false);
ok |= retVal == 0;
ok |= tesseract->SetVariable("language_model_penalty_non_dict_word", "0.2");
ok |= tesseract->SetVariable("language_model_penalty_non_freq_dict_word", "0.2");
if (!ok)
{
NSLog(@"Error initializing tesseract!");
}
So my question is should I train tesseract on another font?
And ,honestly speaking, why I should train it? on default trained data text from Internet, or screen of PC(Mac) I get good recognition.
I also checked original tesseract English trained data it has 38 tiff files, that belong to the following families: 1) Аrial 2) verdana 3 )trebuc 4) times 5) georigia 6 ) cour
It seems that font from image does not belong to this set.
