Create multiple rows by expanding data frame using Python

Question

I am trying to create new data frame based on the following data with 4 columns, start_year, end_ear, ego_id, and alter_id. I need to transform the data to a new data frame which has yearly observation (year column) using start_year and end_year. For example, if start_year in the existing data frame is 2012 and end_year is 2016, the new data frame based on this row should have 5 rows including year 2012, 2013, 2014, 2015, and 2016.

d = {'start_year': [2012, 2016,2006], 'end_year': [2016, 2017,2016],'ego_id':['1011','1011','2211'],'alter_id':['3311','9192','1022']}
df = pd.DataFrame(data=d)
df

    start_year  end_year    ego_id  alter_id
0   2012    2016    1011    3311
1   2016    2017    1011    9192
2   2006    2016    2211    1022

One simple way to do this might be iterating each row in the original data frame, and then create new rows based on start_year and end_year, and finally append theses rows in the new data frame.

However, I found this method inefficient, since I am dealing with large dataset. Is there a way to do it faster?

df_empty=pd.DataFrame()
df_empty['year']=""

for i in range(df.shape[0]):
    row=df.iloc[i,]
    
    for yr in range(row.start_year,row.end_year+1):
        matched_row=pd.Series([],dtype=object)
        matched_row['year']=yr
        matched_row=pd.concat([matched_row,row[2:]],axis=0)
        df_empty=df_empty.append(matched_row,ignore_index=True)



df_empty

    year    alter_id    ego_id
0   2012    3311    1011
1   2013    3311    1011
2   2014    3311    1011
3   2015    3311    1011
4   2016    3311    1011
5   2016    9192    1011
6   2017    9192    1011
7   2006    1022    2211
8   2007    1022    2211
9   2008    1022    2211
10  2009    1022    2211
11  2010    1022    2211
12  2011    1022    2211
13  2012    1022    2211
14  2013    1022    2211
15  2014    1022    2211
16  2015    1022    2211
17  2016    1022    2211

Henry Yik · Accepted Answer · 2020-12-01 16:11:51Z

1

You can use list comprehension to create the list of year and then explode:

print (df.assign(year=[list(range(lo, hi+1)) for lo, hi in df.filter(like="year").to_numpy()])
         .explode("year")
         .drop(["start_year", "end_year"], 1))

  ego_id alter_id  year
0   1011     3311  2012
0   1011     3311  2013
0   1011     3311  2014
0   1011     3311  2015
0   1011     3311  2016
1   1011     9192  2016
1   1011     9192  2017
2   2211     1022  2006
2   2211     1022  2007
2   2211     1022  2008
2   2211     1022  2009
2   2211     1022  2010
2   2211     1022  2011
2   2211     1022  2012
2   2211     1022  2013
2   2211     1022  2014
2   2211     1022  2015
2   2211     1022  2016

answered Dec 1, 2020 at 16:11

Henry Yik

22.6k5 gold badges21 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user14742805 Over a year ago

This works perfectly. I have another column to add: duration_month represents how long the relationship last between ego and alter in month within a year. Is there a way to explode based on month column as well?

data = {     "start_yr": [2012],     "end_yr": [2014],     "start_mon": [1],     "end_mon": [7],     "ego": ["1"],     "alter": ["3"], } df = pandas.DataFrame(data=data)

#desired outcome data = {     "yr": [2012,2013,2014],     "dur_mon": [12,12,7],     "ego": ["1", "1","1"],     "alter": ["3","3","3"], } df = pandas.DataFrame(data=data)

Gijs Wobben · Accepted Answer · 2020-12-01 16:18:28Z

This should work:

import pandas

# Your data
data = {
    "start_year": [2012, 2016, 2006],
    "end_year": [2016, 2017, 2016],
    "ego_id": ["1011", "1011", "2211"],
    "alter_id": ["3311", "9192", "1022"],
}
df = pandas.DataFrame(data=data)

# Add a column with all the years in between the start and end
df["range"] = df.apply(lambda row: range(row["start_year"], row["end_year"] + 1), axis=1)

# Create a new series that contains every year on a new line, maintaining the original index
years = df.apply(lambda x: pandas.Series(x["range"]), axis=1).stack().reset_index(level=1, drop=True)
years.name = "year"

# Join back to the original dataframe on the index
df = df.drop(columns=["range"]).join(years.astype(int))
df

Collectives™ on Stack Overflow

Create multiple rows by expanding data frame using Python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related