0

I am using docling and trying to get images with scanned text to parse with Tesseract OCR (could be any OCR, but tesseract is preferred if possible).

My code is:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = TesseractOcrOptions()

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

The error happens in PyTessBaseAPI._init_api when called with this block in the tesseractocrmodel.__init__ function:

self.reader = tesserocr.PyTessBaseAPI(
    lang=lang,
    psm=main_psm,
    **tesserocr_kwargs,
)

I have all of the language files that are in lang (fra, deu, eng, odu) in both the tessdata dir and copied them into the same place as my py file. What else do I need to do? Python 3.12, tesseract 5.3.4.

4
  • Can you share the complete error traceback? Also, what OS are you on and how did you install tesseract? i mean brew, apt, windows installer?? Commented Nov 23 at 5:02
  • also just as additional info does tesseract --version work in your terminal and does it show the correct tessdata path? Commented Nov 23 at 5:02
  • @JaredMcCarthy, tesseract works in the command line. I ended up getting it to work by adding "/tessdata" to the end of the TESSDATA_PREFIX variable. I think between spyder not picking up the environment variable so setting it in the console, and trying variations I lost track of what is actually necessary. Perhaps copying the traindata files into the folder with my py files along with the path into tessdata ended up working out. Commented Nov 23 at 21:58
  • Glad it worked, TESSDATA_PREFIX needs to point to the parent of the tessdata folder, not the folder itself. For IDEs like Spyder, you might need to set it explicitly with os.environ['TESSDATA_PREFIX']='/path/to/parent' since shell env vars don't always propagate. Commented Nov 24 at 4:47

1 Answer 1

1
  1. Make sure tesseract is in the PATH

  2. Make sure you set TESSDATA_PREFIX to location with tesseract language datafiles.

  3. Run your script e.g. python docling_tesseract.py

    import sys
    
    from docling.datamodel.base_models import InputFormat
    from docling.datamodel.pipeline_options import (PdfPipelineOptions,
                                                    TesseractCliOcrOptions,
                                                    TesseractOcrOptions)
    from docling.document_converter import DocumentConverter, PdfFormatOption
    
    
    def process_document(file_path):
        ocr_options = TesseractCliOcrOptions(lang=["auto"])
        pipeline_options = PdfPipelineOptions(do_ocr=True, ocr_options=ocr_options)
        converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=pipeline_options,
                )
            }
        )
        doc = converter.convert(file_path).document
        md = doc.export_to_markdown()
        print(md)
    
    
    if __name__ == "__main__":
        if len(sys.argv) != 2:
            print("Usage: python docling_tesseract.py <document>")
            sys.exit(1)
        process_document(sys.argv[1])
    
    
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.