0

I am trying to read some text from a pdf file. I am using the code below however when I try to get the text (ptext) all that is return is a string variable of size 1 & its empty.

Why is no text being returned? I have tried other pages and another pdf book but the same thing, I can't seem to read any text.

import PyPDF2

file = open(r'C:/Users/pdfs/test_file.pdf', 'rb')
fileReader = PyPDF2.PdfFileReader(file)

pageObj = fileReader.getPage(445)
ptext = pageObj.extractText()
7
  • 1
    Do these PDFs contain text? Check that – do not say "of course, because I can see it!" Commented Feb 22, 2020 at 11:35
  • 1
    From the extractText docs: "This works well for some PDF files, but poorly for others, depending on the generator used.". I've never had any success with PyPDF2 (especially with PDFs generated from MS Office). Try the alternatives here: How to extract text from a PDF file?. Commented Feb 22, 2020 at 11:36
  • @usr2564301 stupid question here but how do I know if it contains text? I mean I can see words but guess that could be a scanned image? Commented Feb 22, 2020 at 11:43
  • 2
    (1) Open with a canonical PDF reader such as Adobe's own. (2) Select text – if there is no text this step will fail. (3) Copy, paste into a text editor. If the text cannot be decoded, you get nothing or garbage. Commented Feb 22, 2020 at 11:44
  • Have a look at pdfreader Commented Feb 23, 2020 at 2:26

2 Answers 2

1

I also had the same issue, I thought something was wrong with my code or whatnot. After some intense researching, debugging and investigation, it seems that PyPDF2, PyPDF3, PyPDF4 packages cant handle large files... Yes, I tried with a 20 page PDF, ran seamlessly, but put in a 50+ page PDF, and PyPDF crashes.

My only suggestion would be to use a different package altogether. pdftotext is a good recommendation. Use pip install pdftotext.

Sign up to request clarification or add additional context in comments.

Comments

0

I have faced a similar issue while reading my pdf files. Hope the below solution helps. The reason why I faced this issue : The pdf I was selecting was actually a scanned image. I created my resume using a third party site which returned me a pdf. On parsing this type of file, I was not able to extract text directly.

Below is the testes working code

from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
  
def readPdfFile(filePath):  
    pages = convert_from_path(filePath, 500)
    image_counter = 1
    #Part #1 : Converting PDF to images
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        page.save(filename, 'JPEG')
        image_counter = image_counter + 1
        
    #Part #2 - Recognizing text from the images using OCR
    filelimit = image_counter-1 # Variable to get count of total number of pages
  
    for i in range(1, filelimit + 1):
        filename = "page_"+str(i)+".jpg"
        text = str(((pytesseract.image_to_string(Image.open(filename)))))
        text = text.replace('-\n', '')    

    #Part 3 - Remove those temp files
    image_counter = 1
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        os.remove(filename)
        image_counter = image_counter + 1
    return text

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.