0

I want to download pdf files from a website and work with the text. But, I don't want to create a pdf file and then convert it to text. I use python request. Is there any way to get the text directly after the following code?

res = requests.get(url, timeout=None)

2
  • 1
    Possible duplicate of Extracting text from a PDF file using Python Commented Nov 12, 2017 at 22:08
  • 1
    I'd say it isn't a duplicate of ^, because OP is asking "Can I do this...?" And the answer is no. Commented Nov 12, 2017 at 23:24

3 Answers 3

4

AFAIK, you will have to at least create a temp file so that you can perform your process.

You can use the following code which takes / reads a PDF file and converts it to a TEXT file. This makes use of PDFMINER and Python 3.7.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
    manager = PDFResourceManager()
    codec = 'utf-8'
    caching = True
    output = io.StringIO()
    converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()
    print(convertedPDF)

    infile.close()
    converter.close()
    output.close()
    return convertedPDF

Main function to call the above program:

import os
import converter
import sys, getopt

class ConvertMultiple:
    def convert_multiple(pdf_dir, txt_dir):
        if pdf_dir == "": pdf_dir = os.getcwd() + "\\"  # if no pdfDir passed in
        for pdf in os.listdir(pdf_dir):  # iterate through pdfs in pdf directory
            print("File name is %s", os.path.basename(pdf))
            file_extension = pdf.split(".")[-1]
            print("file extension is %s", file_extension)
            if file_extension == "pdf":
                pdf_file_name = pdf_dir + pdf
                path = 'E:/pdf/' + os.path.basename(pdf)
                print(path)
                text = converter.convert('text', path)  # get string of text content of pdf
                text_file_name = txt_dir + pdf + ".txt"
                text_file = open(text_file_name, "w")  # make text file
                text_file.write(text)  # write text to text file


pdf_dir = "E:/pdf"
txt_dir = "E:/text"
ConvertMultiple.convert_multiple(pdf_dir, txt_dir)

Of course you can tune it some more and may be some more room for improvement, but this thing certainly works.

Just make sure instead of providing pdf folder provide a temp pdf file directly.

Hope this helps you..Happy Coding!

Sign up to request clarification or add additional context in comments.

Comments

2

PyPDF2 works fine If all you want is the text

Install the PyPDF2 package https://pypi.org/project/PyPDF2/ on anaconda terminal (or) cmd prompt

pip install PyPDF2

You can use the following code which takes/reads a PDF file and converts it to a TEXT file

import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
def getTextPDF(pdfFileName,password=''):
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    return ('\n'.join (text).replace("\n",''))


getText2PDF('0001.pdf')

Works great for me

Comments

1

If your pdf file is in AWS S3(Simple Storage Service), Pass the Unsigned URL.

import boto3 
from PyPDF2 import PdfFileReader 
from io import BytesIO


def extract_PDF(url): #URL where the pdf is stored online

    CF="https://<Bucket_name>.<Website>.com/"
    object_name = url.replace(CF,'')
    bucket_name="<Bucket_name>.<Website>.com"

    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, object_name)
    fs = obj.get()['Body'].read()
    pdfFile = PdfFileReader(BytesIO(fs))

    text=""
    for page_no in range(len(pdfFile.pages)):
        page = pdfFile.getPage(page_no)
        text += page.extractText()
    text = text.replace('\n','')
    text = text.replace('  ','')
    return text

1 Comment

Probably more helpful to this question to drop anything regarding S3, which confuses what’s relevant, and rewrite this to request a regular URL, per the original question that uses the requests.get() method.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.