0

I am trying to learn the basics of web scraping in python using beautiful soup. I came across code in a document. When I execute it there is an error. The code is:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())

for row in soup('table', {'class': 'mod-data’})[0].tbody('tr'):
  tds = row('td')
  print tds[0].string, tds[1].string

and the error is:

SyntaxError: Non-ASCII character '\xe2' in file ex.py on line 4, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

please help me solve this, and explain the line

for row in soup('table', {'class': 'mod-data’})[0].tbody('tr'):

most of the sites are giving the sample code, not explaining how it came and what is the meaning. It's a bit confusing, the terms like class, tbody etc. It will be really helpful if you could suggest any site or ebooks or anything

1
  • 1
    Did you read the pep pointed to in the error message? Commented Feb 22, 2014 at 16:07

3 Answers 3

3

You have a typo in this line:

soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())

instead of a single quote after .org you have an apostrophe

It should be something like:

soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())

Also:

You have the same issue in the following line. After mod-data change the apostrophe to a quote

Instead of just soup('table', {'class': 'mod-data'})[0].tbody('tr') # syntax error

Try soup.find_all('table', {'class': 'mod-data'})[0].tbody('tr')

OR .findAll for older versions of BeautifulSoup..

You should be using one of soups methods here, like .find_all() which returns a list

Read the BeautifulSoup docs and get the latest version(4) of BeautifulSoup

The following code works for me:

import urllib2
from bs4 import BeautifulSoup # latest version bs4

soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())

for row in soup.find_all("table", {"class": "mod-data"})[0].tbody("tr"):
    tds = row("td")
    print tds[0].string, tds[1].string

Output:

1 Florida State
2 Auburn
3 Alabama
4 Michigan State
5 Stanford
6 Baylor
7 Ohio State
8 Missouri
9 South Carolina
10 Oregon
11 Oklahoma
12 Clemson
13 Oklahoma State
14 Arizona State
15 UCF
16 LSU
17 UCLA
18 Louisville
19 Wisconsin
20 Fresno State
21 Texas A&M;
22 Georgia
23 Northern Illinois
24 Duke
25 USC

If you are having problems using single-quotes on those lines, use double-quotes.

Sign up to request clarification or add additional context in comments.

6 Comments

this now works for me.. please let me know if it works for you
got an error again.'none type object is not callable'
can you post you're full error/traceback at the bottom of your post pls? It's working fine for me
Traceback (most recent call last): File "stack.py", line 6, in <module> for row in soup.find_all('table', {'class': 'mod-data'})[0].tbody('tr'): TypeError: 'NoneType' object is not callable
Have you checked your indentation?
|
1

Try changing your fourth line from:

soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org’).read())

To:

soup = BeautifulSoup(urllib2.urlopen("http://www.bcsfootball.org").read())

It looks like your second single quote was different from the first, so changing to double quotes should alleviate that error.

The code you are asking about is reading from a table. In HTML each row of a table is denoted by the tag, which your program is searching for and then reading from. You are then printing the first and second column of the table you found.

Comments

0

Try changing your second line:

from bs4 import BeautifulSoup

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.