0
def extract_pdf(pdf_path):                     
    with open(pdf_path, 'rb') as fh:
    # iterate over all pages of PDF document
    for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
        # creating a resoure manager
        resource_manager = PDFResourceManager()

        # create a file handle
        fake_file_handle = StringIO()

        # creating a text converter object
        converter = TextConverter(
                            resource_manager, 
                            fake_file_handle, 
                            codec='utf-8', 
                            laparams=LAParams()
                    )

        # creating a page interpreter
        page_interpreter = PDFPageInterpreter(
                            resource_manager, 
                            converter
                        )

        # process current page
        page_interpreter.process_page(page)

        # extract text
        text = fake_file_handle.getvalue()
        yield text

        # close open handles
        converter.close()
        fake_file_handle.close()

text = ''
for page in extract_pdf('Path of the PDF Document'): 
    text +=  page

Through this code, I was able to extract many PDF documents. but when I tested it on other random PDFs from the internet, it starts fluctuating and then the extracted text is not there as an output. When I checked the type of the text, it was showing <class 'str'>.

Can someone rectify any such errors which I had overlooked while writing this code?

1
  • This might be helpful Commented Sep 25, 2019 at 10:56

1 Answer 1

0
import PyPDF2 
o = open('example.pdf', 'rb') 
r = PyPDF2.PdfFileReader(o)
for page in range(r.numPages):
    Obj = r.getPage(page)
    print Obj.extractText()
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.