Extracting scanned pages from PDF using python

Question

I have a lot of PDF files, which are basically scanned documents so every page is one scanned image. I want to perform OCR and extract text from those files. I have tried pytesseract but it does not perform OCR directly on pdf files so as a work around, I want to extract the images from PDF files, save them in directory and then perform OCR using pytesseract on those images directly. Is there any way in python to extract scanned images from pdf files? or is there any way to perform OCR directly on pdf files?

pragmaticprog · Accepted Answer · 2018-05-26 20:56:26Z

3

This question has been addressed in previous Stack Overflow Posts.

Converting PDF to images automatically
Converting a PDF to a series of images with Python

Here is a script that may be helpful: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

Another method: https://www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick

Please check previous posts before asking a question.

EDIT:

Including working script for future reference. Program works for Python3.6 on Windows:

# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.

import sys

with open("Link/To/PDF/File.pdf", "rb") as file:
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

edited May 26, 2018 at 20:56

answered May 26, 2018 at 16:19

pragmaticprog

5723 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Haroon S. Over a year ago

I couldn't find any method that is working with Python 3.6. I am using Anaconda on Windows.

pragmaticprog Over a year ago

I just ran the code from the comments section of the example script I linked to (nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html). I was able to get it working on my Windows Machine running Python 3.6. Let me know if you are still having issues.

Haroon S. Over a year ago

Thank you for your effort. Yes this one is working fine. Upovting.

pragmaticprog Over a year ago

Sweet! Glad I could help!

Jack Griffin Over a year ago

Works on Kubuntu 22.04 running Python 3.10.7 .Thanks.

Collectives™ on Stack Overflow

Extracting scanned pages from PDF using python

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related