I am using docling and trying to get images with scanned text to parse with Tesseract OCR (could be any OCR, but tesseract is preferred if possible).
My code is:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = TesseractOcrOptions()
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
The error happens in PyTessBaseAPI._init_api when called with this block in the tesseractocrmodel.__init__ function:
self.reader = tesserocr.PyTessBaseAPI(
lang=lang,
psm=main_psm,
**tesserocr_kwargs,
)
I have all of the language files that are in lang (fra, deu, eng, odu) in both the tessdata dir and copied them into the same place as my py file. What else do I need to do? Python 3.12, tesseract 5.3.4.
tesseract --versionwork in your terminal and does it show the correct tessdata path?TESSDATA_PREFIXneeds to point to the parent of thetessdatafolder, not the folder itself. For IDEs like Spyder, you might need to set it explicitly withos.environ['TESSDATA_PREFIX']='/path/to/parent'since shell env vars don't always propagate.