Force save an xml file to xls format in Python

Question

I have this code here, that downloads this fund data in Excel 2004 xml format:

import urllib2
url = 'https://www.ishares.com/us/258100/fund-download.dl'
s = urllib2.urlopen(url)
contents = s.read()
file = open("export.xml", 'w')
file.write(contents)
file.close()

My goal is to, programmatically, convert this file to .xls by which I can then read it into a pandas DataFrame. I am aware I can parse this file using python's xml libraries however, I did notice that if I open the xml file and manually save it with the xls file extension, it can be read by pandas and I get my desired result.

I have also attempted using the following code which renames the file extension, however this method does not "force" save the file and it remains as an underlying xml doc with an xls file ext..

import os
import sys
folder = '~/models'
for filename in os.listdir(folder):
    if filename.startswith('export'):
        infilename = filename
        newname = infilename.replace('newfile.xls', 'f.xls')
        output = os.rename(infilename, newname)

https://www.ishares.com/us/258100/fund-download.dl

"... if I open the XMl file and manually save it" - with what application? Excel? If it is Excel and if you do not care about the performance you can do the same conversion that you now perform manually with OLE scripted from python. — sophros
– sophros, Commented Jul 21, 2017 at 13:48
@sophros Yes, manually saving it with Excel. Thanks I'll look into oletools — Anthony
– Anthony, Commented Jul 21, 2017 at 15:55

Parfait · Accepted Answer · 2017-07-21 21:05:23Z

1

With Excel for Windows, consider using Python to COM connect to the Excel object library using the win32com module. Specifically, save the downloaded xml as csv using Excel's Workbooks.OpenXML and SaveAs methods:

import os
import win32com.client as win32    
import requests as r
import pandas as pd

cd = os.path.dirname(os.path.abspath(__file__))

url = "http://www.ishares.com/us/258100/fund-download.dl"
xmlfile = os.path.join(cd, 'iSharesDownload.xml')
csvfile = os.path.join(cd, 'iSharesDownload.csv')

# DOWNLOAD FILE
try:
    rqpage = r.get(url)
    with open(xmlfile, 'wb') as f:
        f.write(rqpage.content)    
except Exception as e:
    print(e)    
finally:
    rqpage = None

# EXCEL COM TO SAVE EXCEL XML AS CSV
if os.path.exists(csvfile):
    os.remove(csvfile)
try:
    excel = win32.gencache.EnsureDispatch('Excel.Application')
    wb = excel.Workbooks.OpenXML(xmlfile)
    wb.SaveAs(csvfile, 6)
    wb.Close(True)    
except Exception as e:
    print(e)    
finally:
    # RELEASES RESOURCES
    wb = None
    excel = None

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

answered Jul 21, 2017 at 21:05

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anthony Over a year ago

I should have specified prior, I am running this script on mac os so can't use the win32com client

Parfait Over a year ago

I actually knew that since you cited Excel 2004 as there is no such Windows version. For future readers this can be helpful. Consider building a macro version of the same and have Python call it command line.

Parfait · Accepted Answer · 2017-07-24 20:08:39Z

With Excel for MAC, consider a VBA solution as VBA is the most common language to interface to the Excel object library. Below downloads the iShares xml then saves it as csv for pandas import using OpenXML and SaveAs methods.

Note: this is untested on Mac but hopefully the Microsoft.XMLHTTP object is available.

VBA (save in a macro-enabled workbook)

Option Explicit

Sub DownloadXML()
On Error GoTo ErrHandle
    Dim wb As Workbook
    Dim xmlDoc As Object
    Dim xmlfile As String, csvfile As String

    xmlfile = ActiveWorkbook.Path & "\file.xml"
    csvfile = ActiveWorkbook.Path & "\file.csv"

    Call DownloadFile("https://www.ishares.com/us/258100/fund-download.dl", xmlfile)

    Set wb = Excel.Workbooks.OpenXML(xmlfile)

    wb.SaveAs csvfile, 6
    wb.Close True

ExitHandle:
    Set wb = Nothing
    Set xmlDoc = Nothing
    Exit Sub

ErrHandle:
    MsgBox Err.Number & " - " & Err.Description, vbCritical
    Resume ExitHandle
End Sub

Function DownloadFile(url As String, filePath As String)
    Dim WinHttpReq As Object, oStream As Object

    Set WinHttpReq = CreateObject("Microsoft.XMLHTTP")
    WinHttpReq.Open "GET", url, False
    WinHttpReq.send

    If WinHttpReq.Status = 200 Then
        Set oStream = CreateObject("ADODB.Stream")
        oStream.Open
        oStream.Type = 1
        oStream.Write WinHttpReq.responseBody
        oStream.SaveToFile filePath, 2 ' 1 = no overwrite, 2 = overwrite
        oStream.Close
    End If

    Set WinHttpReq = Nothing
    Set oStream = Nothing
End Function

Python

import pandas as pd

csvfile = "/path/to/file.csv"

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

Anthony · Accepted Answer · 2017-08-25 16:26:43Z

0

I was able to circumvent web–scraping by finding that the site I was working with had developed an api. Then using python's requests module.

url = "https://www.blackrock.com/tools/hackathon/performance
for ticker in tickers:
    params = {'identifiers': ticker ,
              'returnsType':'MONTHLY'}
    request = requests.get(url, params=params)
    json = request.json()

answered Aug 25, 2017 at 16:26

Anthony

611 silver badge7 bronze badges

Collectives™ on Stack Overflow

Force save an xml file to xls format in Python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related