Extracting data to an Excel file from <html> body in Python

Question

I'm using mechanize to get some data from a password protected site I have a subscription to.

I can reach the site's .txt using the code:

import mechanize
from bs4 import BeautifulSoup

username = ''
password = ''

login_post_url = "http://www.naturalgasintel.com/user/login"
internal_url = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/12/20181221td.txt"

browser = mechanize.Browser()
browser.open(login_post_url)
browser.select_form(nr = 1)
browser.form['user[email]'] = username
browser.form['user[password]'] = password
browser.submit()

response = browser.open(internal_url)
print response.read().decode('utf-8').encode('utf-8')

This prints what I'd like the format to look like (minus the extra white space between data points):

Point Code      Issue Date      Trade Date      Region  Pricing Point   Low     High    Average Volume  Deals   Delivery Start Date     Delivery End Date
STXAGUAD        2018-12-21      2018-12-20      South Texas     Agua Dulce                                              2018-12-21      2018-12-21
STXFGTZ1        2018-12-21      2018-12-20      South Texas     Florida Gas Zone 1      3.580   3.690   3.660   30      7       2018-12-21      2018-12-21
STXNGPL 2018-12-21      2018-12-20      South Texas     NGPL S. TX                                              2018-12-21      2018-12-21
STXTENN 2018-12-21      2018-12-20      South Texas     Tennessee Zone 0 South  3.460   3.580   3.525   230     42      2018-12-21      2018-12-21
STXTETCO        2018-12-21      2018-12-20      South Texas     Texas Eastern S. TX     3.510   3.575   3.530   120     28      2018-12-21      2018-12-21
STXST30 2018-12-21      2018-12-20      South Texas     Transco Zone 1  3.505   3.505   3.505   9       2       2018-12-21      2018-12-21
STX3PAL 2018-12-21      2018-12-20      South Texas     Tres Palacios   3.535   3.720   3.630   196     24      2018-12-21      2018-12-21
STXRAVG 2018-12-21      2018-12-20      South Texas     S. TX Regional Avg.     3.460   3.720   3.570   584     103     2018-12-21      2018-12-21

But I'd like to read and write all of this data into an Excel file.

I've tried using soup = BeautifulSoup(response.read().decode('utf-8').encode('utf-8') to break this into actual text which gives me the same stuff except in html form:

<html><body><p>Point Code\tIssue Date\tTrade Date\tRegion\tPricing Point\tLow\tHigh\tAverage\tVolume\tDeals\tDelivery Start Date\tDelivery End Date\nSTXAGUAD\t2018-12-21\t2018-12-20\tSouth Texas\tAgua Dulce\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXFGTZ1\t2018-12-21\t2018-12-20\tSouth Texas\tFlorida Gas Zone 1\t3.580\t3.690\t3.660\t30\t7\t2018-12-21\t2018-12-21\nSTXNGPL\t2018-12-21\t2018-12-20\tSouth Texas\tNGPL S. TX\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXTENN\t2018-12-21\t2018-12-20\tSouth Texas\tTennessee Zone 0 South\t3.460\t3.580\t3.525\t230\t42\t2018-12-21\t2018-12-21\nSTXTETCO\t2018-12-21\t2018-12-20\tSouth Texas\tTexas Eastern S. TX\t3.510\t3.575\t3.530\t120\t28\t2018-12-21\t2018-12-21\

I could begin looking at stripping off the html tags from this soup variable but is there a way to more easily strip this data?

Do you need to use python 2.7? It honestly a big hassle withing with csv's and trying to work with utf-8. Speaking from experience, it will just save a bunch of time and headache now if you can make the switch. — Daniel Scott
– Daniel Scott, Commented Jan 8, 2019 at 1:17
It doesn't anymore. I was using 2.7 just because an old script used the twill package - but it was causing more damage than good. I'd be fine using 3.x — HelloToEarth
– HelloToEarth, Commented Jan 8, 2019 at 1:20
Great, 2 secs ill whip up an answer for you :) And greetings from Edmonton! — Daniel Scott
– Daniel Scott, Commented Jan 8, 2019 at 1:28
Okay, I updated an answer for you. This should get you on the road to where you want. — Daniel Scott
– Daniel Scott, Commented Jan 8, 2019 at 1:51

Daniel Scott · Accepted Answer · 2019-01-08 02:41:18Z

1

Since you have indicated that you are okay using python3, I would suggest the following steps:

Download Anaconda

Download Anaconda Python for you OS

In the broader opinion, Anaconda has the best native support for data science and data retrieval. You'll be downloading python 3.7, which gives you all of the functionality (a couple changes) of Python 2.7, without a headache. What's important for your case is that python 2.7 is a pain in the butt when working with utf-8. This will fix a lot of those issues:

Install your libraries

After installing Anaconda, (and after you have set conda.exe to your system PATH variable which takes 2 minutes if you opted out during installation), you'll need to install your packages. Judging by your script, that will look something like this:

conda install mechanize,bs4,requests,lxml -y

Be patient - this can take 2-10 mins for conda to "resolve your environment" before installing something.

Parsing your Data with Pandas

There's 2 options for you to try here, and they depend on how lucky you are with the formatting of the html you are scraping

import pandas as pd # This can go at the top with the other imports.

Using pandas.read_html()

response = browser.open(internal_url)
html = response.read().decode('utf-8').encode('utf-8')
df = pd.read_html(html)
print(df) # This should give you a preview of *fingers-crossed* each piece of data in it's own cell.
pd.to_csv(df,"naturalgasintel.csv")

Using pandas.read_data()

response = browser.open(internal_url)
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')
# If your data is embedded within a nested table, you may need to run soup.find() here
df = pd.DataFrame.from_records(soup)
print(df) # This should give you a preview of *fingers-crossed* each piece of data in it's own cell.
pd.to_csv(df,"naturalgasintel.csv")

Hope that helps! Pandas is a fantastic library for intuitively parsing your data.

edited Jan 8, 2019 at 2:41

answered Jan 8, 2019 at 1:50

Daniel Scott

9857 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

HelloToEarth Over a year ago

I know I might be able to use something like requests but I had to bypass it because it did not work for me. Unfortunately mechanize was the only package that gave me what I wanted and it only works in Python 2.7 environments. Also, when using pd.read_html() the argument must be in a readable table form but my junk is just a big string! Dang. So when I use the read

Collectives™ on Stack Overflow

Extracting data to an Excel file from <html> body in Python

1 Answer 1

Download Anaconda

Install your libraries

Parsing your Data with Pandas

Using pandas.read_html()

Using pandas.read_data()

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Download Anaconda

Install your libraries

Parsing your Data with Pandas

Using pandas.read_html()

Using pandas.read_data()

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related