Using Tesseract OCR with Docling PyTessBasseAPI call fails, won't init

Question

I am using docling and trying to get images with scanned text to parse with Tesseract OCR (could be any OCR, but tesseract is preferred if possible).

My code is:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = TesseractOcrOptions()

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

The error happens in PyTessBaseAPI._init_api when called with this block in the tesseractocrmodel.__init__ function:

self.reader = tesserocr.PyTessBaseAPI(
    lang=lang,
    psm=main_psm,
    **tesserocr_kwargs,
)

I have all of the language files that are in lang (fra, deu, eng, odu) in both the tessdata dir and copied them into the same place as my py file. What else do I need to do? Python 3.12, tesseract 5.3.4.

Can you share the complete error traceback? Also, what OS are you on and how did you install tesseract? i mean brew, apt, windows installer?? — Jared McCarthy
– Jared McCarthy, Commented Nov 23 at 5:02
also just as additional info does tesseract --version work in your terminal and does it show the correct tessdata path? — Jared McCarthy
– Jared McCarthy, Commented Nov 23 at 5:02
@JaredMcCarthy, tesseract works in the command line. I ended up getting it to work by adding "/tessdata" to the end of the TESSDATA_PREFIX variable. I think between spyder not picking up the environment variable so setting it in the console, and trying variations I lost track of what is actually necessary. Perhaps copying the traindata files into the folder with my py files along with the path into tessdata ended up working out. — Paul Gibson
– Paul Gibson, Commented Nov 23 at 21:58
Glad it worked, TESSDATA_PREFIX needs to point to the parent of the tessdata folder, not the folder itself. For IDEs like Spyder, you might need to set it explicitly with os.environ['TESSDATA_PREFIX']='/path/to/parent' since shell env vars don't always propagate. — Jared McCarthy
– Jared McCarthy, Commented Nov 24 at 4:47

user898678 · Accepted Answer · 2025-11-23 15:22:33Z

Make sure tesseract is in the PATH
Make sure you set TESSDATA_PREFIX to location with tesseract language datafiles.

Run your script e.g. python docling_tesseract.py

import sys

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (PdfPipelineOptions,
                                                TesseractCliOcrOptions,
                                                TesseractOcrOptions)
from docling.document_converter import DocumentConverter, PdfFormatOption


def process_document(file_path):
    ocr_options = TesseractCliOcrOptions(lang=["auto"])
    pipeline_options = PdfPipelineOptions(do_ocr=True, ocr_options=ocr_options)
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )
    doc = converter.convert(file_path).document
    md = doc.export_to_markdown()
    print(md)


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python docling_tesseract.py <document>")
        sys.exit(1)
    process_document(sys.argv[1])

Collectives™ on Stack Overflow

Using Tesseract OCR with Docling PyTessBasseAPI call fails, won't init

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related