0

This is my code. So far, it'll print all the content of the pdfs to the pages variable. However, I cannot seem to return the same extracted text. I've been testing it by pulling information from random pdfs and placing it in the folder I'm calling. How do I get it to return the extracted text the same way it prints it?

import os
import PyPDF2 as pdf
import pandas as pd

def scan_files(root):
    for path, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.pdf'):
                #print(name)
                pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                numPages = pdf.getNumPages()
                for p in range(0, numPages):
                        pages = ''
                        page = pdf.getPage(p)
                        pages += page.extractText()
                        pages = pages.replace('\n', '')
                        #print(pages)
                        return pages
3
  • The function you are calling by having for loops will stop at the given return statement, it may only return you the first page if I am correct? Describe the output you are getting Commented Jul 19, 2020 at 15:13
  • Where it is right now, yes it only prints the first page. Commented Jul 20, 2020 at 12:33
  • check the answer i gave below Commented Jul 20, 2020 at 12:34

1 Answer 1

0

Printing the text will allow the last for loop to iterate(using the "print(pages)" you mentioned). However, returning pages will terminate the loops running and will spit out the text it covered so far. Try using something like:

def scan_files(root):
    pdftext = ''
    for path, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.pdf'):
                #print(name)
                pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                numPages = pdf.getNumPages()
                
                pages = ''                    

                for p in range(0, numPages):
                    page = pdf.getPage(p)
                    pages += page.extractText()
                    pages = pages.replace('\n', '')

                pdftext += pages

    return pdftext
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.