How can I extract text from a pdf using Python? [duplicate]

Question

def extract_pdf(pdf_path):                     
    with open(pdf_path, 'rb') as fh:
    # iterate over all pages of PDF document
    for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
        # creating a resoure manager
        resource_manager = PDFResourceManager()

        # create a file handle
        fake_file_handle = StringIO()

        # creating a text converter object
        converter = TextConverter(
                            resource_manager, 
                            fake_file_handle, 
                            codec='utf-8', 
                            laparams=LAParams()
                    )

        # creating a page interpreter
        page_interpreter = PDFPageInterpreter(
                            resource_manager, 
                            converter
                        )

        # process current page
        page_interpreter.process_page(page)

        # extract text
        text = fake_file_handle.getvalue()
        yield text

        # close open handles
        converter.close()
        fake_file_handle.close()

text = ''
for page in extract_pdf('Path of the PDF Document'): 
    text +=  page

Through this code, I was able to extract many PDF documents. but when I tested it on other random PDFs from the internet, it starts fluctuating and then the extracted text is not there as an output. When I checked the type of the text, it was showing <class 'str'>.

Can someone rectify any such errors which I had overlooked while writing this code?

This might be helpful

Lord Elrond
– Lord Elrond

2019-09-25 10:56:38 +00:00
Commented Sep 25, 2019 at 10:56 — Lord Elrond
– Lord Elrond, Commented Sep 25, 2019 at 10:56

Rangoli Thakur · Accepted Answer · 2019-09-25 10:54:46Z

0

import PyPDF2 
o = open('example.pdf', 'rb') 
r = PyPDF2.PdfFileReader(o)
for page in range(r.numPages):
    Obj = r.getPage(page)
    print Obj.extractText()

answered Sep 25, 2019 at 10:54

Rangoli Thakur

4934 silver badges4 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can I extract text from a pdf using Python? [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related