I am trying to download the publications on every page of https://occ.ca/our-publications
My end goal is to parse through the text in the PDF files and locate certain keywords.
Thus far, I have been able to scrape the links to the PDF files on all the pages. I have saved these links into a list. Now, I want to go through the list and download all the pdf files with Python. Once the files have been downloads, I want to parse through them.
This is the code that I have used thus far:
import requests
from bs4 import BeautifulSoup
import lxml
import csv
# This code adds all PDF links into a list called
#"publications".
publications=[]
for i in range(19):
response=requests.get('https://occ.ca/our-
publications/page/{}/'.format(i), headers={'User-
Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.findAll('div', {"class":
"publicationoverlay"})
links = [pdf.find('a').attrs['href'] for pdf in pdfs]
publications.append(links)
Next, I want to go through that list and download the PDF files.
import urllib.request
for x in publications:
urllib.request.urlretrieve(x,'Publication_{}'.format(range(213)))
This is the error I get when I run the code.
This is the error I get
Traceback (most recent call last): File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\m.py", line 23, in urllib.request.urlretrieve(x,'Publication_ {}.pdf'.format(range(213))) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen return opener.open(url, data, timeout) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open response = meth(req, response) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response 'http', request, response, code, msg, hdrs) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 569, in error return self._call_chain(*args) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain result = func(*args) File "C:\Users\plumm\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden
publications.extend(links)instead if ` publications.append(links)`for link in publications: rslt = request.get(link)and Then in the for loop take a tool, that can parse pdf and extract words?