extracting text from a pdf in Python [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed 3 months ago.

Improve this question

I am trying to extract text from a PDF.

def getPDFContent(path):
    p = open(path, "rb")
    print(p)
    content = ""
    pdf_content = PyPDF2.PdfFileReader(p)
    print(pdf_content)
    pages = pdf_content.numPages
    print(pages)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
        #print(content)
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

getPDFContent(path_to_sample)

The output I get is:

How can that be fixed?

Sandeep_Rao · Accepted Answer · 2019-07-15 06:11:53Z

Your first mistake is not having a variable assigned to your function call where it return the processed text.

x=getPDFContent(path_to_sample)

If that still doesn't fix the problem: Try Using The module PDF Miner.(PDF Miner.Six for Python 3). PyPDF2 can sometimes be problematic depending on which version of Python you use. I faced issues with PyPDF2 with certain PDF files which gave me a similar oytput to yours. However PDFMiner has worked with the following code consistently with Python 3.xx.

Download PDFMiner with the command: pip install pdfminer.six for Python 2+3 compatibility and use the following code below and you should be good to go.

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    def getPDFContent(path,pages=None):
         pdf_str=""
         if not pages:
           pagenums = set()
         else:
           pagenums = set(pages)
    output = io.StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(path, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    pdf_str=text     
    output.close()
    return(pdf_str)
    x=getPDFContent(path_to_sample)

Collectives™ on Stack Overflow

extracting text from a pdf in Python [closed]

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related