2

I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:

My code is below.

import urllib2
import pandas as pd
import os
import xlrd 

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)

xd = pd.ExcelFile(socket)

#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)

bigdata = df1.append(df2,ignore_index = True)

Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:

                       0    1   2   3         4         5         6   \
0                     NaN  NaN NaN NaN       NaN       NaN       NaN   
1                     NaN  NaN NaN NaN       NaN       NaN       NaN   
2                Maturity  NaN NaN NaN       NaN       NaN       NaN   
3                  years:  NaN NaN NaN       NaN       NaN       NaN   
4                     NaN  NaN NaN NaN       NaN       NaN       NaN   
5     2005-01-03 00:00:00  NaN NaN NaN       NaN       NaN       NaN   
6     2005-01-04 00:00:00  NaN NaN NaN       NaN       NaN       NaN
...                   ...  ...  ..  ..       ...       ...       ...   
5410  2015-04-20 00:00:00  NaN NaN NaN       NaN  0.367987  0.357069   
5411  2015-04-21 00:00:00  NaN NaN NaN       NaN  0.362478  0.352581

It has 5440 rows and 61 columns

However, I want the dataframe to be of the format:

I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.

Date(which is the 2nd Column in the current Dataframe)    Update time(which would just be a column with datetime.datetime.now())    Currency(which would just be a column with 'GBP')    Maturity Date    Yield Data from SpreadSheet
3
  • Tried running your code, but I don't have xlrd installed. If you'd create a small DataFrame illustrating the problem without urllibbing things off the internet, I think other people would be more forthcoming in their replies. Commented Jun 17, 2015 at 17:41
  • @AmiTavory I've given more information about the current Dataframe and the one that I want. Please let me know if further information is required. Commented Jun 17, 2015 at 18:18
  • Well done! <padding> Commented Jun 17, 2015 at 18:18

1 Answer 1

1

I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.

Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).

from pandas.io.excel import read_excel
import pandas as pd
import numpy as np

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'

# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)

# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]

# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape

spot_curve.shape
Out[184]: (2715, 40)

# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape

short_end_spot_curve.shape
Out[185]: (2715, 60)

# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)

combined_data.head()
combined_data.tail()
combined_data.shape

# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================

# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than  50
def filter_func(group):
    return group.isnull().sum(axis=1) <= 50

combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape

combined_data.shape
Out[186]: (2628, 100)


from scipy.interpolate import interp1d

# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)

# write out apply function
def interpolate_maturities(group):
    # transpose row vector to column vector and drops all nans
    a = group.T.dropna().reset_index()
    f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
    return pd.Series(maturity.apply(f).values, index=maturity.values)

# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)

# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')

enter image description here

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Its very helpful. Cheers.
Can I please ask, how I may look to convert the Years which currently increase from left to right such that increases from top to bottom (i.e. Is it possible to make it a column). Thank You

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.