1

So I am putting together a simple Python script to OCR a PDF:

from PIL import Image
from tika import parser
import argparse
import img2pdf
import ocrmypdf

def main():
    
    parser = argparse.ArgumentParser(description="Get text from image.")
    parser.add_argument("image_path", metavar="i", help="The path to the image being used.")
    args = parser.parse_args()
    image_path = args.image_path
    
    pdf_from_image_file_name = convert_to_pdf(image_path)
    pdf_w_ocr_file_name = ocr_pdf()
    raw_text_from_ocr_pdf = get_text_from_pdf()
    print(raw_text_from_ocr_pdf)
    
def convert_to_pdf(image_path, new_pdf_file_name="pdf_from_image"):
    temp_image = Image.open(image_path)
    pdf_bytes = img2pdf.convert(temp_image.filename)
    new_file = open('./' + new_pdf_file_name + '.pdf', 'wb')
    new_file.write(pdf_bytes)
    temp_image.close()
    new_file.close()
    return new_pdf_file_name

def ocr_pdf(pdf_file_path="./temp_pdf_file_name.pdf", new_pdf_file_name="pdf_w_ocr.pdf"):
    ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True)
    return new_pdf_file_name

def get_text_from_pdf(pdf_file_path="./pdf_w_ocr.pdf"):
    raw_pdf = parser.from_file(pdf_file_path)
    return raw_pdf['content']
    
if __name__ == '__main__':
    main()

When the script hits import ocrmypdf it triggers a [WinError 2] The system cannot find the file specified error but continues past it. The conversion from JPG or PNG to PDF works and outputs just fine. However, when reaching the ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True) I get a ValueError: invalid version number '4.0.0.20181030'.

The full stack is:

[WinError 2] The system cannot find the file specified
Traceback (most recent call last):
  File "workshop_v1.py", line 71, in <module>
    main()
  File "workshop_v1.py", line 49, in main
    pdf_w_ocr_file_name = ocr_pdf()
  File "workshop_v1.py", line 63, in ocr_pdf
    ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\api.py", line 339, in ocr
    check_options(options, plugin_manager)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_validation.py", line 271, in check_options
    _check_options(options, plugin_manager, ocr_engine_languages)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_validation.py", line 266, in _check_options
    plugin_manager.hook.check_options(options=options)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 208, in _multicall
    return outcome.get_result()
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\builtin_plugins\tesseract_ocr.py", line 84, in check_options
    version_parser=tesseract.TesseractVersion,
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\subprocess\__init__.py", line 313, in check_external_program
    if found_version and version_parser(found_version) < version_parser(need_version):
  File "C:\Users\xxx\anaconda3\envs\python37\lib\distutils\version.py", line 40, in __init__
    self.parse(vstring)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_exec\tesseract.py", line 72, in parse
    super().parse(vstring)
  File "C:\Users\xxx\anaconda3\envs\python37\lib\distutils\version.py", line 137, in parse
    raise ValueError("invalid version number '%s'" % vstring)
ValueError: invalid version number '4.0.0.20181030'

I'm running this on a x64 PC with Windows 10. Specifically, I'm running a Python 3.7.10 environment via Anaconda. Package version info in Python includes (via pip freeze):

  • pytesseract v0.3.7
  • ocrmypdf 12.1.0
  • ghostscript v0.7

Other potentially important version information outside python includes:

  • tesseract-ocr v4.0.0.20181030 (I've added and tried a number of environmental variables with this, detailed below)
  • leptonica v1.76.0
  • ghostscript v9.54.0
  • qpdf 10.3.2 (this was downloaded and then the files were placed in the C:/Windows/System32 directory)

Tesseract is installed here: C:\Program Files (x86)\Tesseract-OCR\, so I've tried the following environmental variables (as user variables):

  • OCRMYPDF_TESSERACT = C:\Program Files (x86)\Tesseract-OCR\tesseract.exe
  • Added C:\Program Files (x86)\Tesseract-OCR to the end of Path
  • TESSDATA_PREFIX = C:\Program Files (x86)\Tesseract-OCR\tessdata

Add pointers or ideas would be much appreciated!

1

1 Answer 1

1

The repository was updated here per the issue I opened here: https://github.com/jbarlow83/OCRmyPDF/issues/795.

To install use: pip3 install pip install git+https://github.com/jbarlow83/OCRmyPDF.git#egg=ocrmypdf.

I still get [WinError 2] The system cannot find the file specified, but it works so I'm not going to question it at this point.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.