How can I download multiple PDF files with Python?

Question

I am trying to download the publications on every page of https://occ.ca/our-publications

My end goal is to parse through the text in the PDF files and locate certain keywords.

Thus far, I have been able to scrape the links to the PDF files on all the pages. I have saved these links into a list. Now, I want to go through the list and download all the pdf files with Python. Once the files have been downloads, I want to parse through them.

This is the code that I have used thus far:

import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called 
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our- 
   publications/page/{}/'.format(i), headers={'User- 
    Agent': 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class": 
       "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.append(links)

Next, I want to go through that list and download the PDF files.

import urllib.request
for x in publications:
urllib.request.urlretrieve(x,'Publication_{}'.format(range(213)))

This is the error I get when I run the code.

This is the error I get

Traceback (most recent call last): File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\m.py", line 23, in urllib.request.urlretrieve(x,'Publication_ {}.pdf'.format(range(213))) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open response = meth(req, response) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 569, in error return self._call_chain(*args) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain result = func(*args) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

Do you want a hierachical list or just a flat list of links? If you want a flat links of list, then you shouldn't use append, extend so publications.extend(links) instead if ` publications.append(links)` — gelonida
– gelonida, Commented Sep 26, 2019 at 17:23
I'm at a complete loss with the second code snippet. I thought puplications is a list of pdfs? Should you then just do for link in publications: rslt = request.get(link) and Then in the for loop take a tool, that can parse pdf and extract words? — gelonida
– gelonida, Commented Sep 26, 2019 at 17:26
also in above code snippets you're using requests, which is in my opinion rather easy to use. In the second section you use urllib.request, which is in my opinion more annoying to use. I'd suggest to stick with requests for easier code. — gelonida
– gelonida, Commented Sep 26, 2019 at 17:27
To me it looks as if it is a list of lists of pdf links, so it doesn't seem to be a flat list at least if you really used the code that you posted. you declare an empty list named publications, afterwards you execute a for loop in this for loop you create a list named links, and then you append this list to publications. (Thus list of lists) IF you used extend you would have a flat list. — gelonida
– gelonida, Commented Sep 26, 2019 at 17:28

gelonida · Accepted Answer · 2019-09-26 17:52:12Z

1

pls try:

import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called 
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our- 
   publications/page/{}/'.format(i), headers={'User- 
    Agent': 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class": 
       "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.extend(links)

for cntr, link in enumerate(publications):
    print("try to get link", link)
    rslt = requests.get(link)
    print("Got", rslt)
    fname = "temporarypdf_%d.pdf" % cntr
    with open("temporarypdf_%d.pdf" % cntr, "wb") as fout:
        fout.write(rslt.raw.read())
    print("saved pdf data into ", fname)
    # Call here the code that reads and parses the pdf.

edited Sep 26, 2019 at 17:52

answered Sep 26, 2019 at 17:32

gelonida

5,6602 gold badges28 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Andrew Over a year ago

Thanks for your help gelonida. The code works but I'm not sure what it does, as it doesn't seem to have downloaded the PDF files.

gelonida Over a year ago

this answer still does not go through the pdf code, but it should at least try to download each pdf. For going through the text of a pdf file you could use libraries like pypi.org/project/pdfminer.six

gelonida Over a year ago

I enhaced the code to save the pdf into temporary files for debugging / testing.

Leonardo Lima · Accepted Answer · 2022-02-02 22:20:56Z

0

Could you please inform also the line number where the error occours ?

edited Feb 2, 2022 at 22:20

answered Sep 26, 2019 at 17:25

Leonardo Lima

11 gold badge1 silver badge4 bronze badges

5 Comments

gelonida Over a year ago

I'm rather new to Stack Overflow, but shouldn't this better be a comment and not an answer?

Leonardo Lima Over a year ago

I was not able to give a comment, that was the reason that I write an answer. You have to have 50 points reputation.

gelonida Over a year ago

Ah True Didn't see your reps. I forgot this problem , Had the same issue only one or two weeks ago. However I mentioned in the answer, that it is not an answer, but that I can't comment to reduce confusion.

Leonardo Lima Over a year ago

I think that is better if I just delete this answer, you solved the problem =)

gelonida Over a year ago

In any case, I added your comment to the comment's section. So The more experienced guys can decide whether you should delete this comment, or whether it should be kept, so that we can upvote the comment on your anser, such, that you get a few points closer to a better rep.

Collectives™ on Stack Overflow

How can I download multiple PDF files with Python?

Next, I want to go through that list and download the PDF files.

This is the error I get

2 Answers 2

3 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Next, I want to go through that list and download the PDF files.

This is the error I get

2 Answers 2

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related