can't read pdf document using PyPDF2

Question

I am trying to read some text from a pdf file. I am using the code below however when I try to get the text (ptext) all that is return is a string variable of size 1 & its empty.

Why is no text being returned? I have tried other pages and another pdf book but the same thing, I can't seem to read any text.

import PyPDF2

file = open(r'C:/Users/pdfs/test_file.pdf', 'rb')
fileReader = PyPDF2.PdfFileReader(file)

pageObj = fileReader.getPage(445)
ptext = pageObj.extractText()

Do these PDFs contain text? Check that – do not say "of course, because I can see it!" — Jongware
– Jongware, Commented Feb 22, 2020 at 11:35
From the extractText docs: "This works well for some PDF files, but poorly for others, depending on the generator used.". I've never had any success with PyPDF2 (especially with PDFs generated from MS Office). Try the alternatives here: How to extract text from a PDF file?. — Gino Mempin
– Gino Mempin, Commented Feb 22, 2020 at 11:36
@usr2564301 stupid question here but how do I know if it contains text? I mean I can see words but guess that could be a scanned image? — mHelpMe
– mHelpMe, Commented Feb 22, 2020 at 11:43
(1) Open with a canonical PDF reader such as Adobe's own. (2) Select text – if there is no text this step will fail. (3) Copy, paste into a text editor. If the text cannot be decoded, you get nothing or garbage. — Jongware
– Jongware, Commented Feb 22, 2020 at 11:44

Alister Baroi · Accepted Answer · 2020-10-07 11:27:03Z

1

I also had the same issue, I thought something was wrong with my code or whatnot. After some intense researching, debugging and investigation, it seems that PyPDF2, PyPDF3, PyPDF4 packages cant handle large files... Yes, I tried with a 20 page PDF, ran seamlessly, but put in a 50+ page PDF, and PyPDF crashes.

My only suggestion would be to use a different package altogether. pdftotext is a good recommendation. Use pip install pdftotext.

answered Oct 7, 2020 at 11:27

Alister Baroi

531 gold badge1 silver badge8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Vishakha Nagpal · Accepted Answer · 2021-07-21 18:17:49Z

I have faced a similar issue while reading my pdf files. Hope the below solution helps. The reason why I faced this issue : The pdf I was selecting was actually a scanned image. I created my resume using a third party site which returned me a pdf. On parsing this type of file, I was not able to extract text directly.

Below is the testes working code

from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
  
def readPdfFile(filePath):  
    pages = convert_from_path(filePath, 500)
    image_counter = 1
    #Part #1 : Converting PDF to images
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        page.save(filename, 'JPEG')
        image_counter = image_counter + 1
        
    #Part #2 - Recognizing text from the images using OCR
    filelimit = image_counter-1 # Variable to get count of total number of pages
  
    for i in range(1, filelimit + 1):
        filename = "page_"+str(i)+".jpg"
        text = str(((pytesseract.image_to_string(Image.open(filename)))))
        text = text.replace('-\n', '')    

    #Part 3 - Remove those temp files
    image_counter = 1
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        os.remove(filename)
        image_counter = image_counter + 1
    return text

Collectives™ on Stack Overflow

can't read pdf document using PyPDF2

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related