0

I'm using pytesseract to read tabular data out of an image but I'm having trouble with the software making "educated guesses" about characters and word splitting based on context.

I have a specific example I'd like to solve. If I whitelist the $ character then the word splitting gives me this for one line of text:

['Total', '$8,644.27', '$9,653.70']

But If I blacklist the character $ and make no other changes I get this unwanted split near the first comma (and the comma itself is missing):

['Total', '8', '644.27', '9,653.70']

I could just remove the $ after tesseract runs but unfortunately I deliberately excluded $ because then tesseract will often turn symbols like S1 into $1 which is a related, equally annoying change.

It will also get a number wrong sometimes if there is a similar number nearby.

It seems tesseract is trying to be clever under the hood and making LLM style guesses but the thing is I have a very high definition source image so I'd rather tesseract focus less on recognizing words/context and more on just identifying the characters based on their outlines.

The current options I have are:

VALID_CHARS = string.digits + string.ascii_letters + '$.,<>\\/#%()*@&: +-'
CUSTOM_TESSERACT_CONFIG = (
    '--oem 3 --psm 6 '
    f'-c tessedit_char_whitelist="{VALID_CHARS}" '
    '-c tessedit_enable_dict_correction=0 '
    '-c load_system_dawg=0 '
    '-c load_freq_dawg=0 '
    '-c load_punc_dawg=0 '
    '-c load_number_dawg=0 '
    '-c load_unambig_dawg=0 '
    '-c load_bigram_dawg=0 '
    '-c load_fixed_length_dawgs=0 '
    '-c wordrec_enable_assoc=0 '
    '-c language_model_penalty_non_freq_dict_word=0 '
    '-c language_model_penalty_non_dict_word=0 '
    '-c tessedit_prefer_joined_punct=1 '
    '-c textord_enable_word_ngrams=0 '
    '-c tessedit_good_quality_unrej=1 '
    '-c tessedit_enable_bigram_correction=0 '
    '-c tessedit_enable_doc_dict=0 '
    '-c textord_enable_out_of_punct=0 '
    '-c textord_enable_xheight_stats=0 '
    '-c enable_noise_removal=0 '
    '-c classify_enable_adaptive_matcher=0 '
    '-c classify_enable_learning=0 '
    '-c tessedit_preserve_blk_rej_perfect_wds=1 '
    '-c preserve_interword_spaces=1 '
    '-c segment_penalty_dict_case=0 '
    '-c segment_penalty_garbage=0 '
    '-c textord_split_num_pattern=0'
)

I'm not even sure if these options are doing anything or if I need to retrain a model or something.

I need the character or word boundaries not just the text as I have to group words/chars based on their bounding boxes.

I'm really only interested in character recognition (latin alphanumeric and punctuation) and splitting on whitespace only. As long as I get the x,y,w,h coords of each word. I don't want tesseract to change characters based on surrounding characters or punctuation or number formats or frequencies or dictionaries or whatever else it's doing under the hood.

0

1 Answer 1

0

Use the legacy Tesseract mode (--oem 0 or 1) for character OCR.

Switch to --oem 1 (legacy engine) — often better for character-level OCR and gives fewer “smart” guesses.

Sign up to request clarification or add additional context in comments.

2 Comments

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.
It solved the above issue but created new ones. Seems like legacy mode has trouble detecting decimal points, but only sometimes. Feels like it's still making context-based guesses but just less of them. I'll pay the answer though because I don't think a real solution exists without a lot of trial and error.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.