1

I'm using OCRmyPDF to extract text form scanned pdf files. I use codes from this Colab notebook for that purpose. The only difference is that instead of downloading the pdf file from an online url, I use the pdf file stored on my local machine (replaced it {file_name} instead of {invoice_pdf}). Everything looks fine up to the point I run:

os.system(f'ocrmypdf {file_name} output.pdf')

Instead of 0, I get 512! and the next line, when I run !ocrmypdf Performance Evaluations.pdf output.pdf , I get an unrecognized error message which reads like:

usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI]
                [--output-type {pdfa,pdf,pdfa-1,pdfa-2}] [--sidecar [FILE]]
                [--version] [-j N] [-q] [-v [VERBOSE]] [--title TITLE]
                [--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS]
                [-r] [--remove-background] [-d] [-c] [-i] [--oversample DPI]
                [-f] [-s] [--skip-big MPixels] [--max-image-mpixels MPixels]
                [--tesseract-config CFG] [--tesseract-pagesegmode PSM]
                [--tesseract-oem MODE]
                [--pdf-renderer {auto,tesseract,hocr,sandwich}]
                [--tesseract-timeout SECONDS]
                [--rotate-pages-threshold CONFIDENCE]
                [--pdfa-image-compression {auto,jpeg,lossless}]
                [--user-words FILE] [--user-patterns FILE] [--skip-repair]
                [-k] [-g] [--flowchart FLOWCHART]
                input_pdf_or_image output_pdf
ocrmypdf: error: unrecognized arguments: output.pdf

Finally, running the following line:

with pdfplumber.open('output.pdf') as pdf:
    page = pdf.pages[0]
    text = page.extract_text(x_tolerance=2)
    print(text)

returns

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-8274f7005856> in <module>()
----> 1 with pdfplumber.open('output.pdf') as pdf:
      2     page = pdf.pages[0]
      3     text = page.extract_text(x_tolerance=2)
      4     print(text)

/usr/local/lib/python3.6/dist-packages/pdfplumber/pdf.py in open(cls, path_or_fp, **kwargs)
     56     def open(cls, path_or_fp, **kwargs):
     57         if isinstance(path_or_fp, (str, pathlib.Path)):
---> 58             fp = open(path_or_fp, "rb")
     59             inst = cls(fp, **kwargs)
     60             inst.close = fp.close

FileNotFoundError: [Errno 2] No such file or directory: 'output.pdf'

Any help is appreciated. Thanks

1 Answer 1

1

If the file name contains spaces, then you need to enclose the name in quotation marks.

ocrmypdf "Performance Evaluations.pdf" output.pdf

or

ocrmypdf 'Performance Evaluations.pdf' output.pdf
Sign up to request clarification or add additional context in comments.

6 Comments

Hi Alex, thanks for your response. The reason I created a variable for file_name in is that I'll be using this ocrmypdf over and over to convert a couple of hundreds of scanned PDFs into text without changing the file name every time I run it. But as you noted, every time I use the variable instead of the actual file's name I run into the error. Can you suggest any solution to bypass this problem?
test this: os.system( f'ocrmypdf \'{file_name}\' output.pdf')
That worked and now I get 0 for that line, but the next line (!ocrmypdf {file_name} output.pdf ) still sends the same error (ocrmypdf: error: unrecognized arguments: output.pdf)
Double quotes? os.system (f'ocrmypdf \ "{file_name} \" output.pdf ')
Thanks. Double quotes didn't work. I'm using single quotes to do the job.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.