4

I want to scrape a specific part of the website Kickstarter.com

I need the strings of the Project-title. The website is structured and every project has this line.

<div class="Project-title">

My code looks like:

#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup

#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1"
thepage = urllib.request.urlopen(theurl)

#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")

#Scraping "Project Title" (project-title)
project_title = soup.find('h6', {'class': 'project-title'}).findChildren('a')
title = project_title[0].text
print (title)

If I use the soup.find_all or set another value at the line Project_title[0] instead of zero, Python shows an error.

I need a list with all the project titles of this Website. Eg.:

  • The Superbook: Turn your smartphone into a laptop for $99
  • Weights: Weigh Smarter
  • Mine Kafon Drone World's First And Only Complete
  • Weather Camera System Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux
2
  • 1
    Looking at BeautifulSoup's find function, you'll see that it only returns the first element =/ Commented Jul 25, 2016 at 10:48
  • 1
    @Sebastian Fischer, if you have a new question then ask a new question, don't edit code from an answer into your original question Commented Jul 25, 2016 at 13:38

3 Answers 3

2

find()only returns one element. To get all, you must use findAll

Here's the code you need

project_elements = soup.findAll('h6', {'class': 'project-title'})
project_titles = [project.findChildren('a')[0].text for project in project_elements]
print(project_titles)

We look at all the elements of tag h6 and class project-title. We then take the title from each of these elements, and create a list with it.

Hope it helped, and don't hesitate to ask if you have any question.

edit : the problem of the above code is that it will fail if we do not get at least a child of tag a for each element in the list returned by findAll

How to prevent this :

project_titles = [project.findChildren('a')[0].text for project in project_elements if project.findChildren('a')]

this will create the list only if the project.findChildren('a') as at least one element. (if [] returns False)

edit : to get the description of the elements (class project-blurb), let's look a bit at the HTML code.

<p class="project-blurb">
Bagel is a digital tape measure that helps you measure, organize, and analyze any size measurements in a smart way.
</p>

This is only a paragraph of class project-blurb. To get them, we could use the same as we did to get the project_elements, or more condensed :

project_desc = [description.text for description in soup.findAll('p', {'class': 'project-blurb'})]
Sign up to request clarification or add additional context in comments.

8 Comments

Hey HolyDana. Thank you so much!!!!! But I get an error: "IndexError: list index out of range". Do you know why?
@SebastianFischer this error comes from project.findChildren('a')[0]: it fails to find at least a child for one of the elements. I'll edit to add an alternative way to do it, while preventing this error.
Oh HolyDonna.. Thank you. But it won't work. I only get the result "[]" when I print project_titles
@SebastianFischer I only realised I forgot to use findAll instead of find .... The code should be correct now.
Hey @HolyDanna.... Thank you. The code works. Now I get a list, seperated with comma and the correct strings. I want to adapt your code to the class "Project-blurb" to get the description of the Proroject. I paste the code in my question on top.... Thank you
|
1

With respect to the title of this post i would recommend you two different tutorial based on scraping particular data from a website . They do have a detailed explanation regarding how the task is achieved.

Firstly i would recommend to checkout pyimagesearch Scraping images using scrapy.

then you should try if you are more specific web scraping will help you.

Comments

0

All the data you want is in the section with the css class staff-picks, just find the h6's with the project-title class and extract the text from the anchor tag inside:

soup = BeautifulSoup(thepage,"html.parser")


print [a.text for a in soup.select("section.staff-picks h6.project-title a")]

Output:

[u'The Superbook: Turn your smartphone into a laptop for $99', u'Weighitz: Weigh Smarter', u'Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux', u"Bagel: The World's Smartest Tape Measure", u'FireFlies - Truly Wire-Free Earbuds - Music Without Limits!', u'ISOLATE\xae - Switch off your ears!']

Or using find with find_all:

project_titles = soup.find("section",class_="staff-picks").find_all("h6", "project-title")
print([proj.a.text for proj in project_titles])

There is also only one anchor tag inside each h6 tag so you cannot end up with more than one whatever approach you take.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.