How to return all extracted text from multiple PDFs in python?

Question

This is my code. So far, it'll print all the content of the pdfs to the pages variable. However, I cannot seem to return the same extracted text. I've been testing it by pulling information from random pdfs and placing it in the folder I'm calling. How do I get it to return the extracted text the same way it prints it?

import os
import PyPDF2 as pdf
import pandas as pd

def scan_files(root):
    for path, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.pdf'):
                #print(name)
                pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                numPages = pdf.getNumPages()
                for p in range(0, numPages):
                        pages = ''
                        page = pdf.getPage(p)
                        pages += page.extractText()
                        pages = pages.replace('\n', '')
                        #print(pages)
                        return pages

The function you are calling by having for loops will stop at the given return statement, it may only return you the first page if I am correct? Describe the output you are getting — iustin
– iustin, Commented Jul 19, 2020 at 15:13

iustin · Accepted Answer · 2020-07-19 15:28:33Z

0

Printing the text will allow the last for loop to iterate(using the "print(pages)" you mentioned). However, returning pages will terminate the loops running and will spit out the text it covered so far. Try using something like:

def scan_files(root):
    pdftext = ''
    for path, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.pdf'):
                #print(name)
                pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                numPages = pdf.getNumPages()
                
                pages = ''                    

                for p in range(0, numPages):
                    page = pdf.getPage(p)
                    pages += page.extractText()
                    pages = pages.replace('\n', '')

                pdftext += pages

    return pdftext

answered Jul 19, 2020 at 15:28

iustin

6610 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to return all extracted text from multiple PDFs in python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related