I am trying to create new data frame based on the following data with 4 columns, start_year, end_ear, ego_id, and alter_id. I need to transform the data to a new data frame which has yearly observation (year column) using start_year and end_year. For example, if start_year in the existing data frame is 2012 and end_year is 2016, the new data frame based on this row should have 5 rows including year 2012, 2013, 2014, 2015, and 2016.
d = {'start_year': [2012, 2016,2006], 'end_year': [2016, 2017,2016],'ego_id':['1011','1011','2211'],'alter_id':['3311','9192','1022']}
df = pd.DataFrame(data=d)
df
start_year end_year ego_id alter_id
0 2012 2016 1011 3311
1 2016 2017 1011 9192
2 2006 2016 2211 1022
One simple way to do this might be iterating each row in the original data frame, and then create new rows based on start_year and end_year, and finally append theses rows in the new data frame.
However, I found this method inefficient, since I am dealing with large dataset. Is there a way to do it faster?
df_empty=pd.DataFrame()
df_empty['year']=""
for i in range(df.shape[0]):
row=df.iloc[i,]
for yr in range(row.start_year,row.end_year+1):
matched_row=pd.Series([],dtype=object)
matched_row['year']=yr
matched_row=pd.concat([matched_row,row[2:]],axis=0)
df_empty=df_empty.append(matched_row,ignore_index=True)
df_empty
year alter_id ego_id
0 2012 3311 1011
1 2013 3311 1011
2 2014 3311 1011
3 2015 3311 1011
4 2016 3311 1011
5 2016 9192 1011
6 2017 9192 1011
7 2006 1022 2211
8 2007 1022 2211
9 2008 1022 2211
10 2009 1022 2211
11 2010 1022 2211
12 2011 1022 2211
13 2012 1022 2211
14 2013 1022 2211
15 2014 1022 2211
16 2015 1022 2211
17 2016 1022 2211