Convert pdf to text without creating a file

Question

I want to download pdf files from a website and work with the text. But, I don't want to create a pdf file and then convert it to text. I use python request. Is there any way to get the text directly after the following code?

res = requests.get(url, timeout=None)

Possible duplicate of Extracting text from a PDF file using Python — phd
– phd, Commented Nov 12, 2017 at 22:08
I'd say it isn't a duplicate of ^, because OP is asking "Can I do this...?" And the answer is no. — cs95
– cs95, Commented Nov 12, 2017 at 23:24

illusionx · Accepted Answer · 2018-02-16 11:08:50Z

AFAIK, you will have to at least create a temp file so that you can perform your process.

You can use the following code which takes / reads a PDF file and converts it to a TEXT file. This makes use of PDFMINER and Python 3.7.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    manager = PDFResourceManager()
    codec = 'utf-8'
    caching = True
    output = io.StringIO()
    converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()
    print(convertedPDF)

    infile.close()
    converter.close()
    output.close()
    return convertedPDF

Main function to call the above program:

import os
import converter
import sys, getopt

class ConvertMultiple:
    def convert_multiple(pdf_dir, txt_dir):
        if pdf_dir == "": pdf_dir = os.getcwd() + "\\"  # if no pdfDir passed in
        for pdf in os.listdir(pdf_dir):  # iterate through pdfs in pdf directory
            print("File name is %s", os.path.basename(pdf))
            file_extension = pdf.split(".")[-1]
            print("file extension is %s", file_extension)
            if file_extension == "pdf":
                pdf_file_name = pdf_dir + pdf
                path = 'E:/pdf/' + os.path.basename(pdf)
                print(path)
                text = converter.convert('text', path)  # get string of text content of pdf
                text_file_name = txt_dir + pdf + ".txt"
                text_file = open(text_file_name, "w")  # make text file
                text_file.write(text)  # write text to text file


pdf_dir = "E:/pdf"
txt_dir = "E:/text"
ConvertMultiple.convert_multiple(pdf_dir, txt_dir)

Of course you can tune it some more and may be some more room for improvement, but this thing certainly works.

Just make sure instead of providing pdf folder provide a temp pdf file directly.

Hope this helps you..Happy Coding!

thrinadhn · Accepted Answer · 2020-03-16 13:09:09Z

2

PyPDF2 works fine If all you want is the text

Install the PyPDF2 package https://pypi.org/project/PyPDF2/ on anaconda terminal (or) cmd prompt

pip install PyPDF2

You can use the following code which takes/reads a PDF file and converts it to a TEXT file

import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
def getTextPDF(pdfFileName,password=''):
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    return ('\n'.join (text).replace("\n",''))


getText2PDF('0001.pdf')

Works great for me

edited Mar 16, 2020 at 13:09

answered Sep 18, 2018 at 12:39

thrinadhn

2,67326 silver badges36 bronze badges

Comments

Krooz · Accepted Answer · 2020-02-23 16:18:53Z

1

If your pdf file is in AWS S3(Simple Storage Service), Pass the Unsigned URL.

import boto3 
from PyPDF2 import PdfFileReader 
from io import BytesIO


def extract_PDF(url): #URL where the pdf is stored online

    CF="https://<Bucket_name>.<Website>.com/"
    object_name = url.replace(CF,'')
    bucket_name="<Bucket_name>.<Website>.com"

    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, object_name)
    fs = obj.get()['Body'].read()
    pdfFile = PdfFileReader(BytesIO(fs))

    text=""
    for page_no in range(len(pdfFile.pages)):
        page = pdfFile.getPage(page_no)
        text += page.extractText()
    text = text.replace('\n','')
    text = text.replace('  ','')
    return text

answered Feb 23, 2020 at 16:18

Krooz

313 bronze badges

1 Comment

jeffbyrnes Over a year ago

Probably more helpful to this question to drop anything regarding S3, which confuses what’s relevant, and rewrite this to request a regular URL, per the original question that uses the requests.get() method.

Collectives™ on Stack Overflow

Convert pdf to text without creating a file

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related