Restructuring Dataframe in Python

Question

I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:

My code is below.

import urllib2
import pandas as pd
import os
import xlrd 

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)

xd = pd.ExcelFile(socket)

#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)

bigdata = df1.append(df2,ignore_index = True)

Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:

                       0    1   2   3         4         5         6   \
0                     NaN  NaN NaN NaN       NaN       NaN       NaN   
1                     NaN  NaN NaN NaN       NaN       NaN       NaN   
2                Maturity  NaN NaN NaN       NaN       NaN       NaN   
3                  years:  NaN NaN NaN       NaN       NaN       NaN   
4                     NaN  NaN NaN NaN       NaN       NaN       NaN   
5     2005-01-03 00:00:00  NaN NaN NaN       NaN       NaN       NaN   
6     2005-01-04 00:00:00  NaN NaN NaN       NaN       NaN       NaN
...                   ...  ...  ..  ..       ...       ...       ...   
5410  2015-04-20 00:00:00  NaN NaN NaN       NaN  0.367987  0.357069   
5411  2015-04-21 00:00:00  NaN NaN NaN       NaN  0.362478  0.352581

It has 5440 rows and 61 columns

However, I want the dataframe to be of the format:

I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.

Date(which is the 2nd Column in the current Dataframe)    Update time(which would just be a column with datetime.datetime.now())    Currency(which would just be a column with 'GBP')    Maturity Date    Yield Data from SpreadSheet

Tried running your code, but I don't have xlrd installed. If you'd create a small DataFrame illustrating the problem without urllibbing things off the internet, I think other people would be more forthcoming in their replies. — Ami Tavory
– Ami Tavory, Commented Jun 17, 2015 at 17:41
@AmiTavory I've given more information about the current Dataframe and the one that I want. Please let me know if further information is required. — Jojo
– Jojo, Commented Jun 17, 2015 at 18:18

marc_s · Accepted Answer · 2015-08-26 20:52:15Z

I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.

Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).

from pandas.io.excel import read_excel
import pandas as pd
import numpy as np

url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'

# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)

# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]

# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape

spot_curve.shape
Out[184]: (2715, 40)

# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape

short_end_spot_curve.shape
Out[185]: (2715, 60)

# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)

combined_data.head()
combined_data.tail()
combined_data.shape

# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================

# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than  50
def filter_func(group):
    return group.isnull().sum(axis=1) <= 50

combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape

combined_data.shape
Out[186]: (2628, 100)


from scipy.interpolate import interp1d

# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)

# write out apply function
def interpolate_maturities(group):
    # transpose row vector to column vector and drops all nans
    a = group.T.dropna().reset_index()
    f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
    return pd.Series(maturity.apply(f).values, index=maturity.values)

# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)

# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')

enter image description here

Can I please ask, how I may look to convert the Years which currently increase from left to right such that increases from top to bottom (i.e. Is it possible to make it a column). Thank You

Collectives™ on Stack Overflow

Restructuring Dataframe in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related