4

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:

EPS (Basic)\n13.4620.6226.6930.1732.81\n\n

So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].

Thanks for any help.

from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup

ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials'  #build url

text_soup = BeautifulSoup(urlopen(full_url).read()) #read in 

text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)

eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
    print eps.group(1)
1
  • The html after I soup.prettify() is:</a> EPS (Basic) </td> <td class="valueCell"> 13.46 </td> <td class="valueCell"> 20.62 </td> <td class="valueCell"> 26.69 </td> <td class="valueCell"> 30.17 </td> <td class="valueCell"> 32.81 </td> Commented Jul 17, 2013 at 20:29

3 Answers 3

2

It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in

titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
    if 'EPS (Basic)' in title.text:
        print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]

prints:

['13.46', '20.62', '26.69', '30.17', '32.81']

Hope that helps.

Sign up to request clarification or add additional context in comments.

1 Comment

That is a really nice and simple solution. But when I run it I get an extra u character in the output: [u'13.46', u'20.62', u'26.69', u'30.17', u'32.81'] Any thoughts?
2

I would take a very different approach. We use LXML for scraping html pages

One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.

In my test I ran the following

import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content

tree = html.fromstring(page_as_string)

Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.

tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]

now I noticed that the first row has the column headings, so I want to separate all of the rows

table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']

now lets get the column headings:

column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']

Finally we can map the column headings to the row labels and cell values

my_results = []
for row in table_rows[1:]:
    cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
    temp_dict = OrderedDict()
    for numb, cell in enumerate(cell_content):
        if numb == 0:
            temp_dict['row_label'] = cell.strip()
         else:
            dict_key = column_headings[numb]
            temp_dict[dict_key] = cell

    my_results.append(temp_dict)

now to access the results

for row_dict in my_results:
    if row_dict['row_label'] == 'EPS (Basic)':
        for key in row_dict:
            print key, ':', row_dict[key]   


row_label :  EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend : 

Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).

Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.

I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.

When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE

I hope this helps

Comments

1
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd

url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'

soup = BeautifulSoup(urllib2.urlopen(url).read())

table = soup.find('table', {'data-ajax-content' : 'true'})

data = []

for row in table.findAll('tr'):
    cells = row.findAll('td')
    cols = [ele.text.strip() for ele in cells]
    data.append([ele for ele in cols if ele])

df = pd.DataFrame(data)

print df

dictframe = df.to_dict()

print dictframe

The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.