Scraping the web in python

Question

I'm completely new to scraping the web but I really want to learn it in python. I have a basic understanding of python.

I'm having trouble understanding a code to scrape a webpage because I can't find a good documentation about the modules which the code uses.

The code scraps some movie's data of this webpage

I get stuck after the comment "selection in pattern follows the rules of CSS".

I would like to understand the logic behind that code or a good documentation to understand that modules. Is there any previous topic which I need to learn?

The code is the following :

import requests
from pattern import web
from BeautifulSoup import BeautifulSoup

url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
r = requests.get(url)
print r.url

url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)
print r.url  # notice it constructs the full url for you

#selection in pattern follows the rules of CSS

dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):    
    title = movie.by_tag('a')[0].content
    genres = movie.by_tag('span.genre')[0].by_tag('a')
    genres = [g.content for g in genres]
    runtime = movie.by_tag('span.runtime')[0].content
    rating = movie.by_tag('span.value')[0].content
    print title, genres, runtime, rating

haferje · Accepted Answer · 2014-01-12 04:17:00Z

1

Here's the documentation for BeautifulSoup, which is an HTML and XML parser.

The comment

selection in pattern follows the rules of CSS

means the strings such as 'td.title' and 'span.runtime' are CSS selectors that help find the data you are looking for, where td.title searches for the <TD> element with attribute class="title".

The code is iterating through the HTML elements in the webpage body and extracting title, genres, runtime, and rating by the CSS selectors .

answered Jan 12, 2014 at 4:17

haferje

9932 gold badges15 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scraping the web in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related