Scraping links within links already scraped with python

Question

I am web-scraping lot of pdfs of committee meetings off a local government website. (https://www.gmcameetings.co.uk/) Therefore there are links.. within links... within links. I can successfully scrape all the 'a' tags from the main area of the page (the ones that I want), but when I try and scrape anything within them I get the error in the title of the question: AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? How do I fix this?

I am completely new to coding and started an internship yesterday for which I am expected to web-scrape this information. The woman I'm supposed to be working with is not here for another couple of days and nobody else can help me - so please bear with me and be kind as I am a complete beginner and doing this alone. I know I have set up the first part of the code correctly as I can download the the whole page or download any particular links. Again, it's when I try and scrape within the links I have already (and successfully scraped) that I get the above error message. I think (with the little knowledge I know) that it's because of the 'output' of the 'all_links' which comes out as below. I have tried both find() and findAll() which both result in the same error message.

 #the error message
 date_links_area = all_links.find('ul',{"class":"item-list item-list-- 
 rich"})
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\Users\rache\AppData\Local\Programs\Python\Python37-32\lib\site- 
 packages\bs4\element.py", line 1620, in __getattr__
 "ResultSet object has no attribute '%s'. You're probably treating a list 
 of items like a single item. Did you call find_all() when you meant to 
 call 
 find()?" % key
 AttributeError: ResultSet object has no attribute 'find'. You're probably 
 treating a list of items like a single item. Did you call find_all() when 
 you meant to call find()?

#output of all_links looks like this (this is only part of it)

href="https://www.gmcameetings.co.uk/info/20180/live_meetings/199/membership_201819">Members of the GMCA 2018/19, Greater Manchester Combined Authority Constitution, Meeting papers,

Some of those links then go to a page that has a list of dates - which is the area of the page I'm trying to get to. Then within that area I need to get the links with the dates. Then within them I need to grab the pdfs I want. Apologies if this doesn't make sense. I'm trying my best to do this on my own with zero experience.

Is this one page you are trying to scrape? It is accessible from the "Members' register of interests" link on the main page. Are you just trying to get the PDFs and dates from that page, or are there others? For instance, this has a list of links with dates that point to another PDF listing. Are you trying to get those links too? — Ajax1234
– Ajax1234, Commented Jul 9, 2019 at 14:47
It's all of the links on the main page (gmcameetings.co.uk) for every committee. I set a 'main area' of the page which contains all the links I want (for example I don't want the contact links at the bottom etc.) and therefore the 'all_links' are the ones I want. So basically scraping from multiple pages. Yes exactly, I want all of those links to that point to PDF listings. — Rachel9866
– Rachel9866, Commented Jul 9, 2019 at 14:53
Does this answer your question? Beautiful Soup: 'ResultSet' object has no attribute 'find_all'? — AMC
– AMC, Commented Mar 22, 2020 at 22:36

Ajax1234 · Accepted Answer · 2019-07-09 15:21:42Z

2

This solution uses recursion to continuously scrape the links on each page until the PDF urls are discovered:

from bs4 import BeautifulSoup as soup
import requests
def scrape(url):
  try:
    for i in soup(requests.get(url).text, 'html.parser').find('main', {'id':'content'}).find_all('a'):
      if '/downloads/meeting/' in i['href'] or '/downloads/file/' in i['href']:
         yield i
      elif i['href'].startswith('https://www.gmcameetings.co.uk'):
         yield from scrape(i['href'])
  except:
      pass

urls = list(scrape('https://www.gmcameetings.co.uk/'))

edited Jul 9, 2019 at 15:21

answered Jul 9, 2019 at 15:03

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rachel9866 Over a year ago

Thank you you're a lifesaver. I've set it running and it's returning results. I'm going to work through it and figure out what all the code means so I can understand it better - but at least for now I can go onto the next step now I have these. Thank you again!

facelessuser · Accepted Answer · 2019-07-09 14:50:47Z

0

The error is actually telling you what the problem is. all_links is a list (ResultSet object) of HTML elements you found. You need to iterate the list and call find on each one:

sub_links = [all_links.find('ul',{"class":"item-list item-list-- 
 rich"}) for link in all_links]

answered Jul 9, 2019 at 14:50

facelessuser

1,7421 gold badge13 silver badges12 bronze badges

Collectives™ on Stack Overflow

Scraping links within links already scraped with python

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related