Parsing html data into python list for manipulation

Question

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:

EPS (Basic)\n13.4620.6226.6930.1732.81\n\n

So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].

Thanks for any help.

from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup

ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials'  #build url

text_soup = BeautifulSoup(urlopen(full_url).read()) #read in 

text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)

eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
    print eps.group(1)

The html after I soup.prettify() is:</a> EPS (Basic) </td> <td class="valueCell"> 13.46 </td> <td class="valueCell"> 20.62 </td> <td class="valueCell"> 26.69 </td> <td class="valueCell"> 30.17 </td> <td class="valueCell"> 32.81 </td> — Warren Lamont
– Warren Lamont, Commented Jul 17, 2013 at 20:29

alecxe · Accepted Answer · 2013-07-17 20:11:20Z

2

It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in

titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
    if 'EPS (Basic)' in title.text:
        print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]

prints:

['13.46', '20.62', '26.69', '30.17', '32.81']

Hope that helps.

answered Jul 17, 2013 at 20:11

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Warren Lamont Over a year ago

That is a really nice and simple solution. But when I run it I get an extra u character in the output: [u'13.46', u'20.62', u'26.69', u'30.17', u'32.81'] Any thoughts?

PyNEwbie · Accepted Answer · 2013-07-18 17:13:34Z

I would take a very different approach. We use LXML for scraping html pages

One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.

In my test I ran the following

import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content

tree = html.fromstring(page_as_string)

Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.

tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]

now I noticed that the first row has the column headings, so I want to separate all of the rows

table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']

now lets get the column headings:

column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']

Finally we can map the column headings to the row labels and cell values

my_results = []
for row in table_rows[1:]:
    cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
    temp_dict = OrderedDict()
    for numb, cell in enumerate(cell_content):
        if numb == 0:
            temp_dict['row_label'] = cell.strip()
         else:
            dict_key = column_headings[numb]
            temp_dict[dict_key] = cell

    my_results.append(temp_dict)

now to access the results

for row_dict in my_results:
    if row_dict['row_label'] == 'EPS (Basic)':
        for key in row_dict:
            print key, ':', row_dict[key]   


row_label :  EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :

Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).

Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.

I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.

When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE

I hope this helps

Matt Lamont · Accepted Answer · 2015-07-26 15:55:12Z

1

from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd

url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'

soup = BeautifulSoup(urllib2.urlopen(url).read())

table = soup.find('table', {'data-ajax-content' : 'true'})

data = []

for row in table.findAll('tr'):
    cells = row.findAll('td')
    cols = [ele.text.strip() for ele in cells]
    data.append([ele for ele in cols if ele])

df = pd.DataFrame(data)

print df

dictframe = df.to_dict()

print dictframe

The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

answered Jul 26, 2015 at 15:55

Matt Lamont

213 bronze badges

Collectives™ on Stack Overflow

Parsing html data into python list for manipulation

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related