0

I am trying to extract text from a PDF.

def getPDFContent(path):
    p = open(path, "rb")
    print(p)
    content = ""
    pdf_content = PyPDF2.PdfFileReader(p)
    print(pdf_content)
    pages = pdf_content.numPages
    print(pages)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
        #print(content)
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

getPDFContent(path_to_sample)

The output I get is:

enter image description here

How can that be fixed?

1 Answer 1

1

Your first mistake is not having a variable assigned to your function call where it return the processed text.

x=getPDFContent(path_to_sample)

If that still doesn't fix the problem: Try Using The module PDF Miner.(PDF Miner.Six for Python 3). PyPDF2 can sometimes be problematic depending on which version of Python you use. I faced issues with PyPDF2 with certain PDF files which gave me a similar oytput to yours. However PDFMiner has worked with the following code consistently with Python 3.xx.

Download PDFMiner with the command: pip install pdfminer.six for Python 2+3 compatibility and use the following code below and you should be good to go.

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    def getPDFContent(path,pages=None):
         pdf_str=""
         if not pages:
           pagenums = set()
         else:
           pagenums = set(pages)
    output = io.StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(path, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    pdf_str=text     
    output.close()
    return(pdf_str)
    x=getPDFContent(path_to_sample)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.