0

I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.

My code looks like:

import urllib   
import re

def scrape():
        url = "https://www.theWebsite.com"

        statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
        htmlfile = urllib.urlopen(url)
        htmltext = htmlfile.read()

        status = re.findall(statusText,htmltext)

        print("Status: " + str(status))
scrape()

Which unfortunately returns only: "Status: []"

However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code

statusText = re.compile('<a href="/about">(.+?)</a>')

instead and I would get what I was trying to, "Status: ['About', 'About']"

Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.

3
  • Did you check that htmltext is not empty in the first place? Commented May 15, 2014 at 7:21
  • @zx81 I don't see how it could be any different than when the a tags were there instead of the div tags. Wouldn't htmltext hold data in both or neither cases regardless? Commented May 15, 2014 at 7:29
  • Is it absolutely necessary that you need to use Regex ? try checking out BeautifulSoup or Scrappy libraries in python Commented May 15, 2014 at 7:35

1 Answer 1

4

I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div> then you can use beautifulsoup and extract this division by:

soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})

also see the official documentation of bs4

as an example i have edited your code for extracting a division form an article of bloomberg you can make your own changes

import urllib   
import re
from bs4 import BeautifulSoup

def scrape():
    url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    soup = BeautifulSoup(htmltext)
    ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
    print ans
scrape()

You can BeautifulSoup from here

P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.