No output for OCRmyPDF

Question

I'm using OCRmyPDF to extract text form scanned pdf files. I use codes from this Colab notebook for that purpose. The only difference is that instead of downloading the pdf file from an online url, I use the pdf file stored on my local machine (replaced it {file_name} instead of {invoice_pdf}). Everything looks fine up to the point I run:

os.system(f'ocrmypdf {file_name} output.pdf')

Instead of 0, I get 512! and the next line, when I run !ocrmypdf Performance Evaluations.pdf output.pdf , I get an unrecognized error message which reads like:

usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI]
                [--output-type {pdfa,pdf,pdfa-1,pdfa-2}] [--sidecar [FILE]]
                [--version] [-j N] [-q] [-v [VERBOSE]] [--title TITLE]
                [--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS]
                [-r] [--remove-background] [-d] [-c] [-i] [--oversample DPI]
                [-f] [-s] [--skip-big MPixels] [--max-image-mpixels MPixels]
                [--tesseract-config CFG] [--tesseract-pagesegmode PSM]
                [--tesseract-oem MODE]
                [--pdf-renderer {auto,tesseract,hocr,sandwich}]
                [--tesseract-timeout SECONDS]
                [--rotate-pages-threshold CONFIDENCE]
                [--pdfa-image-compression {auto,jpeg,lossless}]
                [--user-words FILE] [--user-patterns FILE] [--skip-repair]
                [-k] [-g] [--flowchart FLOWCHART]
                input_pdf_or_image output_pdf
ocrmypdf: error: unrecognized arguments: output.pdf

Finally, running the following line:

with pdfplumber.open('output.pdf') as pdf:
    page = pdf.pages[0]
    text = page.extract_text(x_tolerance=2)
    print(text)

returns

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-19-8274f7005856> in <module>()
----> 1 with pdfplumber.open('output.pdf') as pdf:
      2     page = pdf.pages[0]
      3     text = page.extract_text(x_tolerance=2)
      4     print(text)

/usr/local/lib/python3.6/dist-packages/pdfplumber/pdf.py in open(cls, path_or_fp, **kwargs)
     56     def open(cls, path_or_fp, **kwargs):
     57         if isinstance(path_or_fp, (str, pathlib.Path)):
---> 58             fp = open(path_or_fp, "rb")
     59             inst = cls(fp, **kwargs)
     60             inst.close = fp.close

FileNotFoundError: [Errno 2] No such file or directory: 'output.pdf'

Any help is appreciated. Thanks

Alex Alex · Accepted Answer · 2021-01-05 08:18:52Z

1

If the file name contains spaces, then you need to enclose the name in quotation marks.

ocrmypdf "Performance Evaluations.pdf" output.pdf

or

ocrmypdf 'Performance Evaluations.pdf' output.pdf

answered Jan 5, 2021 at 8:18

Alex Alex

2,0381 gold badge9 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Zia Over a year ago

Hi Alex, thanks for your response. The reason I created a variable for file_name in is that I'll be using this ocrmypdf over and over to convert a couple of hundreds of scanned PDFs into text without changing the file name every time I run it. But as you noted, every time I use the variable instead of the actual file's name I run into the error. Can you suggest any solution to bypass this problem?

Alex Alex Over a year ago

test this: os.system( f'ocrmypdf \'{file_name}\' output.pdf')

Zia Over a year ago

That worked and now I get 0 for that line, but the next line (!ocrmypdf {file_name} output.pdf ) still sends the same error (ocrmypdf: error: unrecognized arguments: output.pdf)

Alex Alex Over a year ago

Double quotes? os.system (f'ocrmypdf \ "{file_name} \" output.pdf ')

Zia Over a year ago

Thanks. Double quotes didn't work. I'm using single quotes to do the job.

|

Collectives™ on Stack Overflow

No output for OCRmyPDF

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related