3

I am web-scraping lot of pdfs of committee meetings off a local government website. (https://www.gmcameetings.co.uk/) Therefore there are links.. within links... within links. I can successfully scrape all the 'a' tags from the main area of the page (the ones that I want), but when I try and scrape anything within them I get the error in the title of the question: AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()? How do I fix this?

I am completely new to coding and started an internship yesterday for which I am expected to web-scrape this information. The woman I'm supposed to be working with is not here for another couple of days and nobody else can help me - so please bear with me and be kind as I am a complete beginner and doing this alone. I know I have set up the first part of the code correctly as I can download the the whole page or download any particular links. Again, it's when I try and scrape within the links I have already (and successfully scraped) that I get the above error message. I think (with the little knowledge I know) that it's because of the 'output' of the 'all_links' which comes out as below. I have tried both find() and findAll() which both result in the same error message.

 #the error message
 date_links_area = all_links.find('ul',{"class":"item-list item-list-- 
 rich"})
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\Users\rache\AppData\Local\Programs\Python\Python37-32\lib\site- 
 packages\bs4\element.py", line 1620, in __getattr__
 "ResultSet object has no attribute '%s'. You're probably treating a list 
 of items like a single item. Did you call find_all() when you meant to 
 call 
 find()?" % key
 AttributeError: ResultSet object has no attribute 'find'. You're probably 
 treating a list of items like a single item. Did you call find_all() when 
 you meant to call find()?

#output of all_links looks like this (this is only part of it)

href="https://www.gmcameetings.co.uk/info/20180/live_meetings/199/membership_201819">Members of the GMCA 2018/19, Greater Manchester Combined Authority Constitution, Meeting papers,

Some of those links then go to a page that has a list of dates - which is the area of the page I'm trying to get to. Then within that area I need to get the links with the dates. Then within them I need to grab the pdfs I want. Apologies if this doesn't make sense. I'm trying my best to do this on my own with zero experience.

4
  • Is this one page you are trying to scrape? It is accessible from the "Members' register of interests" link on the main page. Are you just trying to get the PDFs and dates from that page, or are there others? For instance, this has a list of links with dates that point to another PDF listing. Are you trying to get those links too? Commented Jul 9, 2019 at 14:47
  • It's all of the links on the main page (gmcameetings.co.uk) for every committee. I set a 'main area' of the page which contains all the links I want (for example I don't want the contact links at the bottom etc.) and therefore the 'all_links' are the ones I want. So basically scraping from multiple pages. Yes exactly, I want all of those links to that point to PDF listings. Commented Jul 9, 2019 at 14:53
  • can we assume only links within the Meeting papers? Commented Jul 9, 2019 at 15:30
  • Does this answer your question? Beautiful Soup: 'ResultSet' object has no attribute 'find_all'? Commented Mar 22, 2020 at 22:36

2 Answers 2

2

This solution uses recursion to continuously scrape the links on each page until the PDF urls are discovered:

from bs4 import BeautifulSoup as soup
import requests
def scrape(url):
  try:
    for i in soup(requests.get(url).text, 'html.parser').find('main', {'id':'content'}).find_all('a'):
      if '/downloads/meeting/' in i['href'] or '/downloads/file/' in i['href']:
         yield i
      elif i['href'].startswith('https://www.gmcameetings.co.uk'):
         yield from scrape(i['href'])
  except:
      pass

urls = list(scrape('https://www.gmcameetings.co.uk/'))
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you you're a lifesaver. I've set it running and it's returning results. I'm going to work through it and figure out what all the code means so I can understand it better - but at least for now I can go onto the next step now I have these. Thank you again!
0

The error is actually telling you what the problem is. all_links is a list (ResultSet object) of HTML elements you found. You need to iterate the list and call find on each one:

sub_links = [all_links.find('ul',{"class":"item-list item-list-- 
 rich"}) for link in all_links]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.