I'm using pytesseract to read tabular data out of an image but I'm having trouble with the software making "educated guesses" about characters and word splitting based on context.
I have a specific example I'd like to solve. If I whitelist the $ character then the word splitting gives me this for one line of text:
['Total', '$8,644.27', '$9,653.70']
But If I blacklist the character $ and make no other changes I get this unwanted split near the first comma (and the comma itself is missing):
['Total', '8', '644.27', '9,653.70']
I could just remove the $ after tesseract runs but unfortunately I deliberately excluded $ because then tesseract will often turn symbols like S1 into $1 which is a related, equally annoying change.
It will also get a number wrong sometimes if there is a similar number nearby.
It seems tesseract is trying to be clever under the hood and making LLM style guesses but the thing is I have a very high definition source image so I'd rather tesseract focus less on recognizing words/context and more on just identifying the characters based on their outlines.
The current options I have are:
VALID_CHARS = string.digits + string.ascii_letters + '$.,<>\\/#%()*@&: +-'
CUSTOM_TESSERACT_CONFIG = (
'--oem 3 --psm 6 '
f'-c tessedit_char_whitelist="{VALID_CHARS}" '
'-c tessedit_enable_dict_correction=0 '
'-c load_system_dawg=0 '
'-c load_freq_dawg=0 '
'-c load_punc_dawg=0 '
'-c load_number_dawg=0 '
'-c load_unambig_dawg=0 '
'-c load_bigram_dawg=0 '
'-c load_fixed_length_dawgs=0 '
'-c wordrec_enable_assoc=0 '
'-c language_model_penalty_non_freq_dict_word=0 '
'-c language_model_penalty_non_dict_word=0 '
'-c tessedit_prefer_joined_punct=1 '
'-c textord_enable_word_ngrams=0 '
'-c tessedit_good_quality_unrej=1 '
'-c tessedit_enable_bigram_correction=0 '
'-c tessedit_enable_doc_dict=0 '
'-c textord_enable_out_of_punct=0 '
'-c textord_enable_xheight_stats=0 '
'-c enable_noise_removal=0 '
'-c classify_enable_adaptive_matcher=0 '
'-c classify_enable_learning=0 '
'-c tessedit_preserve_blk_rej_perfect_wds=1 '
'-c preserve_interword_spaces=1 '
'-c segment_penalty_dict_case=0 '
'-c segment_penalty_garbage=0 '
'-c textord_split_num_pattern=0'
)
I'm not even sure if these options are doing anything or if I need to retrain a model or something.
I need the character or word boundaries not just the text as I have to group words/chars based on their bounding boxes.
I'm really only interested in character recognition (latin alphanumeric and punctuation) and splitting on whitespace only. As long as I get the x,y,w,h coords of each word. I don't want tesseract to change characters based on surrounding characters or punctuation or number formats or frequencies or dictionaries or whatever else it's doing under the hood.