Prevent tesseract guessing characters based on surrounding context instead of just the character outline

Question

I'm using pytesseract to read tabular data out of an image but I'm having trouble with the software making "educated guesses" about characters and word splitting based on context.

I have a specific example I'd like to solve. If I whitelist the $ character then the word splitting gives me this for one line of text:

['Total', '$8,644.27', '$9,653.70']

But If I blacklist the character $ and make no other changes I get this unwanted split near the first comma (and the comma itself is missing):

['Total', '8', '644.27', '9,653.70']

I could just remove the $ after tesseract runs but unfortunately I deliberately excluded $ because then tesseract will often turn symbols like S1 into $1 which is a related, equally annoying change.

It will also get a number wrong sometimes if there is a similar number nearby.

It seems tesseract is trying to be clever under the hood and making LLM style guesses but the thing is I have a very high definition source image so I'd rather tesseract focus less on recognizing words/context and more on just identifying the characters based on their outlines.

The current options I have are:

VALID_CHARS = string.digits + string.ascii_letters + '$.,<>\\/#%()*@&: +-'
CUSTOM_TESSERACT_CONFIG = (
    '--oem 3 --psm 6 '
    f'-c tessedit_char_whitelist="{VALID_CHARS}" '
    '-c tessedit_enable_dict_correction=0 '
    '-c load_system_dawg=0 '
    '-c load_freq_dawg=0 '
    '-c load_punc_dawg=0 '
    '-c load_number_dawg=0 '
    '-c load_unambig_dawg=0 '
    '-c load_bigram_dawg=0 '
    '-c load_fixed_length_dawgs=0 '
    '-c wordrec_enable_assoc=0 '
    '-c language_model_penalty_non_freq_dict_word=0 '
    '-c language_model_penalty_non_dict_word=0 '
    '-c tessedit_prefer_joined_punct=1 '
    '-c textord_enable_word_ngrams=0 '
    '-c tessedit_good_quality_unrej=1 '
    '-c tessedit_enable_bigram_correction=0 '
    '-c tessedit_enable_doc_dict=0 '
    '-c textord_enable_out_of_punct=0 '
    '-c textord_enable_xheight_stats=0 '
    '-c enable_noise_removal=0 '
    '-c classify_enable_adaptive_matcher=0 '
    '-c classify_enable_learning=0 '
    '-c tessedit_preserve_blk_rej_perfect_wds=1 '
    '-c preserve_interword_spaces=1 '
    '-c segment_penalty_dict_case=0 '
    '-c segment_penalty_garbage=0 '
    '-c textord_split_num_pattern=0'
)

I'm not even sure if these options are doing anything or if I need to retrain a model or something.

I need the character or word boundaries not just the text as I have to group words/chars based on their bounding boxes.

I'm really only interested in character recognition (latin alphanumeric and punctuation) and splitting on whitespace only. As long as I get the x,y,w,h coords of each word. I don't want tesseract to change characters based on surrounding characters or punctuation or number formats or frequencies or dictionaries or whatever else it's doing under the hood.

Rakesh_Shesh · Accepted Answer · 2025-05-04 03:34:25Z

0

Use the legacy Tesseract mode (--oem 0 or 1) for character OCR.

Switch to --oem 1 (legacy engine) — often better for character-level OCR and gives fewer “smart” guesses.

answered May 4 at 3:34

Rakesh_Shesh

392 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Community May 4 at 15:04

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

SpliFF May 5 at 1:15

It solved the above issue but created new ones. Seems like legacy mode has trouble detecting decimal points, but only sometimes. Feels like it's still making context-based guesses but just less of them. I'll pay the answer though because I don't think a real solution exists without a lot of trial and error.

Collectives™ on Stack Overflow

Prevent tesseract guessing characters based on surrounding context instead of just the character outline

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related