I'm using mechanize to get some data from a password protected site I have a subscription to.
I can reach the site's .txt using the code:
import mechanize
from bs4 import BeautifulSoup
username = ''
password = ''
login_post_url = "http://www.naturalgasintel.com/user/login"
internal_url = "https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/12/20181221td.txt"
browser = mechanize.Browser()
browser.open(login_post_url)
browser.select_form(nr = 1)
browser.form['user[email]'] = username
browser.form['user[password]'] = password
browser.submit()
response = browser.open(internal_url)
print response.read().decode('utf-8').encode('utf-8')
This prints what I'd like the format to look like (minus the extra white space between data points):
Point Code Issue Date Trade Date Region Pricing Point Low High Average Volume Deals Delivery Start Date Delivery End Date
STXAGUAD 2018-12-21 2018-12-20 South Texas Agua Dulce 2018-12-21 2018-12-21
STXFGTZ1 2018-12-21 2018-12-20 South Texas Florida Gas Zone 1 3.580 3.690 3.660 30 7 2018-12-21 2018-12-21
STXNGPL 2018-12-21 2018-12-20 South Texas NGPL S. TX 2018-12-21 2018-12-21
STXTENN 2018-12-21 2018-12-20 South Texas Tennessee Zone 0 South 3.460 3.580 3.525 230 42 2018-12-21 2018-12-21
STXTETCO 2018-12-21 2018-12-20 South Texas Texas Eastern S. TX 3.510 3.575 3.530 120 28 2018-12-21 2018-12-21
STXST30 2018-12-21 2018-12-20 South Texas Transco Zone 1 3.505 3.505 3.505 9 2 2018-12-21 2018-12-21
STX3PAL 2018-12-21 2018-12-20 South Texas Tres Palacios 3.535 3.720 3.630 196 24 2018-12-21 2018-12-21
STXRAVG 2018-12-21 2018-12-20 South Texas S. TX Regional Avg. 3.460 3.720 3.570 584 103 2018-12-21 2018-12-21
But I'd like to read and write all of this data into an Excel file.
I've tried using soup = BeautifulSoup(response.read().decode('utf-8').encode('utf-8') to break this into actual text which gives me the same stuff except in html form:
<html><body><p>Point Code\tIssue Date\tTrade Date\tRegion\tPricing Point\tLow\tHigh\tAverage\tVolume\tDeals\tDelivery Start Date\tDelivery End Date\nSTXAGUAD\t2018-12-21\t2018-12-20\tSouth Texas\tAgua Dulce\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXFGTZ1\t2018-12-21\t2018-12-20\tSouth Texas\tFlorida Gas Zone 1\t3.580\t3.690\t3.660\t30\t7\t2018-12-21\t2018-12-21\nSTXNGPL\t2018-12-21\t2018-12-20\tSouth Texas\tNGPL S. TX\t\t\t\t\t\t2018-12-21\t2018-12-21\nSTXTENN\t2018-12-21\t2018-12-20\tSouth Texas\tTennessee Zone 0 South\t3.460\t3.580\t3.525\t230\t42\t2018-12-21\t2018-12-21\nSTXTETCO\t2018-12-21\t2018-12-20\tSouth Texas\tTexas Eastern S. TX\t3.510\t3.575\t3.530\t120\t28\t2018-12-21\t2018-12-21\
I could begin looking at stripping off the html tags from this soup variable but is there a way to more easily strip this data?
twillpackage - but it was causing more damage than good. I'd be fine using 3.x