How to scrape href with Python 3.5 and BeautifulSoup [duplicate]

Question

I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.

That's my code

#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup

#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")


#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)

I get [None, None, .... None, None] back. I need a list with all the href from the class .

Any ideas?

Gábor Erdős · Accepted Answer · 2016-07-26 11:42:07Z

4

Try something like this:

import urllib.request
from bs4 import BeautifulSoup

theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

soup = BeautifulSoup(thepage)

project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)

This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.

project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]

This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.

EDIT:

To solve the problem you mentioned in the comments:

soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
    print(i.a['href'])

edited Jul 26, 2016 at 11:42

answered Jul 25, 2016 at 14:32

Gábor Erdős

3,6894 gold badges28 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sebastian Fischer Over a year ago

Oh yes . That works. Thx... Is it possible to get only the hrefs from the class <div class="Project-card-content"> ?

Gábor Erdős Over a year ago

Sure, i will edit my post as soon as i get to work

Sebastian Fischer Over a year ago

please update the code. Thank you for that...

Sebastian Fischer Over a year ago

Thak you. Now I get a list with the correct hrefs. Thats nice. Do you know what I have to code to get a sting? I mean a result like this: ['href1', 'href2', 'href3',...., 'href10'] because my other data looks like this and I want to export the data to a csv and split them into seperate rows. Thank you so much

Gábor Erdős Over a year ago

The code i presented you get the link line by line. You can use [i.a['href'] for i in soup.find_all('div', attrs={'class' : 'project-card-content'})] to get it back as a list.

Collectives™ on Stack Overflow

How to scrape href with Python 3.5 and BeautifulSoup [duplicate]

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related