4

I am using pdfminer to extract data from pdf files using python. I would like to extract all the data present in pdf irrespective of wheather it is an image or text or whatever it is. Can we do that in a single line(or two if needed, without much work). Any help is appreciated. Thanks in advance

3 Answers 3

7

Can we do that in a single line(or two if needed, without much work).

No, you cannot. Pdfminer is powerful but it's rather low-level.

Unfortunately, the documentation is not exactly exhaustive. I was able to find my way around it thanks to some code by Denis Papathanasiou. The code is discussed in his blog, and you can find the source here: layout_scanner.py

See also this answer, where I give a little more detail.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your quick reply. I am working to extract html from pdf, so that i can convert it to epub. I am able to get html, but the images present in pdf are missing,can u suggest a way to extract them along with and save them in a folder(if required)
Also check the documentation for the commandline tool pdf2txt.py, which comes with PDFMiner-- It says there it can extract embedded jpg images (but only jpg; Dennis's code handles multiple types).
1

For Python 3:

pip install pdfminer.six

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path, codec='utf-8'):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Comments

-1

for python3 , there is another one : pip install pdfminer3k

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
import time
from functools import wraps

def fn_timer(function)://this is for calculating the run time(function)
    @wraps(function)
    def function_timer(*args, **kwargs):
        t0 = time.time()
        result = function(*args, **kwargs)
        t1 = time.time()
        print ("Total time running %s: %s seconds" %
                ('test', str(t1-t0))
                )
        return result
    return function_timer

@fn_timer
def convert_pdf(path, pages):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    fp = open(path, 'rb')
    process_pdf(rsrcmgr, device, fp,pages)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

file = r'M:\a.pdf'

print(convert_pdf(file,[1,]))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.