0

I am trying to download the publications on every page of https://occ.ca/our-publications

My end goal is to parse through the text in the PDF files and locate certain keywords.

Thus far, I have been able to scrape the links to the PDF files on all the pages. I have saved these links into a list. Now, I want to go through the list and download all the pdf files with Python. Once the files have been downloads, I want to parse through them.

This is the code that I have used thus far:

import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called 
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our- 
   publications/page/{}/'.format(i), headers={'User- 
    Agent': 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class": 
       "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.append(links)

Next, I want to go through that list and download the PDF files.

import urllib.request
for x in publications:
urllib.request.urlretrieve(x,'Publication_{}'.format(range(213)))

This is the error I get when I run the code.

This is the error I get

Traceback (most recent call last): File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\m.py", line 23, in urllib.request.urlretrieve(x,'Publication_ {}.pdf'.format(range(213))) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open response = meth(req, response) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 569, in error return self._call_chain(*args) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain result = func(*args) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

7
  • 1
    Do you want a hierachical list or just a flat list of links? If you want a flat links of list, then you shouldn't use append, extend so publications.extend(links) instead if ` publications.append(links)` Commented Sep 26, 2019 at 17:23
  • I'm at a complete loss with the second code snippet. I thought puplications is a list of pdfs? Should you then just do for link in publications: rslt = request.get(link) and Then in the for loop take a tool, that can parse pdf and extract words? Commented Sep 26, 2019 at 17:26
  • publications is a list of pdf links. Commented Sep 26, 2019 at 17:27
  • also in above code snippets you're using requests, which is in my opinion rather easy to use. In the second section you use urllib.request, which is in my opinion more annoying to use. I'd suggest to stick with requests for easier code. Commented Sep 26, 2019 at 17:27
  • 1
    To me it looks as if it is a list of lists of pdf links, so it doesn't seem to be a flat list at least if you really used the code that you posted. you declare an empty list named publications, afterwards you execute a for loop in this for loop you create a list named links, and then you append this list to publications. (Thus list of lists) IF you used extend you would have a flat list. Commented Sep 26, 2019 at 17:28

2 Answers 2

1

pls try:

import requests
from bs4 import BeautifulSoup
import lxml
import csv

# This code adds all PDF links into a list called 
#"publications".

publications=[]
for i in range(19):
    response=requests.get('https://occ.ca/our- 
   publications/page/{}/'.format(i), headers={'User- 
    Agent': 'Mozilla'})

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'lxml')
        pdfs = soup.findAll('div', {"class": 
       "publicationoverlay"})
        links = [pdf.find('a').attrs['href'] for pdf in pdfs]
    publications.extend(links)

for cntr, link in enumerate(publications):
    print("try to get link", link)
    rslt = requests.get(link)
    print("Got", rslt)
    fname = "temporarypdf_%d.pdf" % cntr
    with open("temporarypdf_%d.pdf" % cntr, "wb") as fout:
        fout.write(rslt.raw.read())
    print("saved pdf data into ", fname)
    # Call here the code that reads and parses the pdf.
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your help gelonida. The code works but I'm not sure what it does, as it doesn't seem to have downloaded the PDF files.
this answer still does not go through the pdf code, but it should at least try to download each pdf. For going through the text of a pdf file you could use libraries like pypi.org/project/pdfminer.six
I enhaced the code to save the pdf into temporary files for debugging / testing.
0

Could you please inform also the line number where the error occours ?

5 Comments

I'm rather new to Stack Overflow, but shouldn't this better be a comment and not an answer?
I was not able to give a comment, that was the reason that I write an answer. You have to have 50 points reputation.
Ah True Didn't see your reps. I forgot this problem , Had the same issue only one or two weeks ago. However I mentioned in the answer, that it is not an answer, but that I can't comment to reduce confusion.
I think that is better if I just delete this answer, you solved the problem =)
In any case, I added your comment to the comment's section. So The more experienced guys can decide whether you should delete this comment, or whether it should be kept, so that we can upvote the comment on your anser, such, that you get a few points closer to a better rep.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.