Unable to scrape certain values of a website using regex

Question

I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.

My code looks like:

import urllib   
import re

def scrape():
        url = "https://www.theWebsite.com"

        statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
        htmlfile = urllib.urlopen(url)
        htmltext = htmlfile.read()

        status = re.findall(statusText,htmltext)

        print("Status: " + str(status))
scrape()

Which unfortunately returns only: "Status: []"

However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code

statusText = re.compile('<a href="/about">(.+?)</a>')

instead and I would get what I was trying to, "Status: ['About', 'About']"

Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.

Did you check that htmltext is not empty in the first place? — zx81
– zx81, Commented May 15, 2014 at 7:21
@zx81 I don't see how it could be any different than when the a tags were there instead of the div tags. Wouldn't htmltext hold data in both or neither cases regardless? — Vale
– Vale, Commented May 15, 2014 at 7:29
Is it absolutely necessary that you need to use Regex ? try checking out BeautifulSoup or Scrappy libraries in python — Mevin Babu
– Mevin Babu, Commented May 15, 2014 at 7:35

Vipul · Accepted Answer · 2014-05-15 07:50:01Z

I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div> then you can use beautifulsoup and extract this division by:

soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})

also see the official documentation of bs4

as an example i have edited your code for extracting a division form an article of bloomberg you can make your own changes

import urllib   
import re
from bs4 import BeautifulSoup

def scrape():
    url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    soup = BeautifulSoup(htmltext)
    ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
    print ans
scrape()

You can BeautifulSoup from here

P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it

Collectives™ on Stack Overflow

Unable to scrape certain values of a website using regex

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related