Screen Scraping in Python

Question

I am currently trying to screen scrape a website to put info into a dictionary. I am using urllib2 and BeautifulSoup. I cannot figure out how to parse the web pages source info to get what I want and to read it into a dictionary. The info I want is displayed as <title>Nov 24 | 8:00AM | Sole In. Peace Out. </title> in the source code. I am thinking of using a reg expression to read in the line, convert the time and date to a datetime, and then parse the line to read the data into a dictionary. The dictionary output should be something along the lines of

[ { "date": dateime(2010, 11, 24, 23, 59), "title": "Sole In. Peace Out.", } ]

Current Code:

from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://events.cmich.edu/RssStudentEvents.aspx'
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

Sorry for the wall of text, and thank you for your time and help!

"wall of text"? My answer's more "wally" ;-)

Chris Morgan
– Chris Morgan

2010-11-24 04:43:37 +00:00
Commented Nov 24, 2010 at 4:43 — Chris Morgan
– Chris Morgan, Commented Nov 24, 2010 at 4:43
Have you looked at the mechanize module?

robert
– robert

2010-11-24 08:44:46 +00:00
Commented Nov 24, 2010 at 8:44 — robert
– robert, Commented Nov 24, 2010 at 8:44

dpn · Accepted Answer · 2010-11-24 06:13:30Z

1

Something like this..

titletext = soup.findAll('title')[1].string #assuming it's the second title element.. I've seen worse in html
import datetime
datetext = titletext.split("|")[0]
title = titletext.split("|")[2]
date = datetime.datetime.strptime(datetext,"%b %d").replace(year=2010)
the_final_dict = {'date':date,'title':title}

findAll() returns all instances of the search element.. so you can just treat it like any other list.

That should just about do it :)

Edit: small fix

Edit2: fix from comments below

edited Nov 24, 2010 at 6:13

answered Nov 24, 2010 at 4:03

dpn

6104 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

amazinghorse24 Over a year ago

The first 'title' element is actually one I want to skip, so how do I go about doing that?

icyrock.com · Accepted Answer · 2010-11-24 04:29:48Z

0

EDIT: I did not realize it's not a HTML page, so take a look at the correction by Chris. The below would work for HTML pages.

You can use:

titleTag = soup.html.head.title

or:

soup.findAll('title')

Take a look here:

http://www.crummy.com/software/BeautifulSoup/documentation.html

edited Nov 24, 2010 at 4:29

answered Nov 24, 2010 at 3:59

icyrock.com

28.7k4 gold badges72 silver badges92 bronze badges

1 Comment

Chris Morgan Over a year ago

It's not HTML. It's RSS. Thus soup.html.head.title won't work, and `soup.findAll('title') is sub-optimal. Did you look at the page he gave?

Chris Morgan · Accepted Answer · 2010-11-24 04:39:07Z

>>> soup.findAll('item')[1].title
<title>Nov 24 | 8:00AM | Sole In. Peace Out. </title>
>>> soup.findAll('item')[1].title.text
u'Nov 24 | 8:00AM | Sole In. Peace Out.'
>>> date, _, title = soup.findAll('item')[1].title.text.rpartition(' | ')
>>> date
u'Nov 24 | 8:00AM'
>>> title
u'Sole In. Peace Out.'
>>> from datetime import datetime
>>> date = datetime.strptime(date, "%b %d | %I:%M%p").replace(year=datetime.now().year)
>>> dict(date=date, title=title)
{'date': datetime.datetime(2010, 11, 24, 8, 0), 'title': u'Sole In. Peace Out.'}

Note that that's also including the time of day.

And then, as I think you want all the items,

>>> from datetime import datetime
>>> matches = []
>>> for item in soup.findAll('item'):
...     date, _, title = item.title.text.rpartition(' | ')
...     matches.append(dict(date=datetime.strptime(date, '%b %d | %I:%M%p').replace(year=datetime.now().year), title=title))
... 
>>> from pprint import pprint
>>> pprint(matches)
[{'date': datetime.datetime(2010, 11, 24, 8, 0),
  'title': u'The Americana Indian\u2014American Indian in the American Imagination'},
 {'date': datetime.datetime(2010, 11, 24, 8, 0),
  'title': u'Sole In. Peace Out.'},
...
 {'date': datetime.datetime(2010, 12, 8, 8, 0),
  'title': u'Apply to be an FYE Mentor'}]

If you wanted more complex year handling you could do it. You get the idea.

Final addition: a generator would be a nice way of using this.

from datetime import datetime
import urllib2
from BeautifulSoup import BeautifulSoup

def whatevers():
    soup = BeautifulSoup(urllib2.urlopen('http://events.cmich.edu/RssStudentEvents.aspx').read())
    for item in soup.findAll('item'):
        date, _, title = item.title.text.rpartition(' | ')
        yield dict(date=datetime.strptime(date, '%b %d | %I:%M%p').replace(year=datetime.now().year), title=title)

for match in whatevers():
    pass  # Use match['date'], match['title'].  a namedtuple might also be neat here.

Collectives™ on Stack Overflow

Screen Scraping in Python

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related