Adding rows to groups in Pandas DataFrame

Question

I have the following Pandas DataFrame:

     start_timestamp_milli  end_timestamp_milli       name  rating
1            1555414708025        1555414723279    Valence       2   
2            1555414708025        1555414723279    Arousal       6   
3            1555414708025        1555414723279  Dominance       2   
4            1555414708025        1555414723279    Sadness       1   
5            1555414813304        1555414831795    Valence       3   
6            1555414813304        1555414831795    Arousal       5   
7            1555414813304        1555414831795  Dominance       2   
8            1555414813304        1555414831795    Sadness       1   
9            1555414921819        1555414931382    Valence       1   
10           1555414921819        1555414931382    Arousal       7   
11           1555414921819        1555414931382  Dominance       2   
12           1555414921819        1555414931382    Sadness       1   
13           1555414921819        1555414931382      Anger       1

In the above example there are three groups which can be grouped by start_timestamp_milli and end_timestamp_milli. The first group is from index 1 to 4, the second group from index 5 - 8 and the third group from index 9 to 13.

For each such group, if in the name column "Anger" and Happiness" is not present, I would like to insert it with a rating of 0. If it is present, nothing should happen.

The final result should look like this. The added lines are line 5, 6, 11, 12 and 18.

     start_timestamp_milli  end_timestamp_milli       name  rating
1            1555414708025        1555414723279    Valence       2   
2            1555414708025        1555414723279    Arousal       6   
3            1555414708025        1555414723279  Dominance       2   
4            1555414708025        1555414723279    Sadness       1
5            1555414708025        1555414723279    Happiness     0
6            1555414708025        1555414723279    Anger         0
7            1555414813304        1555414831795    Valence       3   
8            1555414813304        1555414831795    Arousal       5   
9            1555414813304        1555414831795  Dominance       2   
10           1555414813304        1555414831795    Sadness       1
11           1555414813304        1555414831795    Happiness     0
12           1555414813304        1555414831795    Anger         0   
13           1555414921819        1555414931382    Valence       1   
14           1555414921819        1555414931382    Arousal       7   
15           1555414921819        1555414931382  Dominance       2   
16           1555414921819        1555414931382    Sadness       1 
17           1555414921819        1555414931382   Happiness      0  
18           1555414921819        1555414931382      Anger       1

How can this be done?

piRSquared · Accepted Answer · 2019-05-09 15:24:32Z

Option 1

This very explicitly loops through each group and appends dummy dataframe and drops duplicates.

d = dict(name=['Anger', 'Happiness'], rating=0)
cols = ['start_timestamp_milli', 'end_timestamp_milli']
def f(d0, k):
    d1 = pd.DataFrame({**dict(zip(cols, k)), **d})
    return d0.append(d1, ignore_index=True).drop_duplicates('name')

pd.concat([f(d, k) for k, d in df.groupby(cols)], ignore_index=True)

    start_timestamp_milli  end_timestamp_milli       name  rating
0           1555414708025        1555414723279    Valence       2
1           1555414708025        1555414723279    Arousal       6
2           1555414708025        1555414723279  Dominance       2
3           1555414708025        1555414723279    Sadness       1
4           1555414708025        1555414723279      Anger       0
5           1555414708025        1555414723279  Happiness       0
6           1555414813304        1555414831795    Valence       3
7           1555414813304        1555414831795    Arousal       5
8           1555414813304        1555414831795  Dominance       2
9           1555414813304        1555414831795    Sadness       1
10          1555414813304        1555414831795      Anger       0
11          1555414813304        1555414831795  Happiness       0
12          1555414921819        1555414931382    Valence       1
13          1555414921819        1555414931382    Arousal       7
14          1555414921819        1555414931382  Dominance       2
15          1555414921819        1555414931382    Sadness       1
16          1555414921819        1555414931382      Anger       1
17          1555414921819        1555414931382  Happiness       0

Option 2

This builds a new index and uses reindex

cats = ['Anger', 'Happiness']
cols = ['start_timestamp_milli', 'end_timestamp_milli']

d = df.set_index([*cols, 'name'])
i = pd.MultiIndex.from_tuples(
    [(s, e, n) for s, e in {*zip(*map(df.get, cols))} for n in cats],
    names=d.index.names
) | d.index

df.set_index([*cols, 'name']).reindex(i, fill_value=0).reset_index()

    start_timestamp_milli  end_timestamp_milli       name  rating
0           1555414708025        1555414723279      Anger       0
1           1555414708025        1555414723279    Arousal       6
2           1555414708025        1555414723279  Dominance       2
3           1555414708025        1555414723279  Happiness       0
4           1555414708025        1555414723279    Sadness       1
5           1555414708025        1555414723279    Valence       2
6           1555414813304        1555414831795      Anger       0
7           1555414813304        1555414831795    Arousal       5
8           1555414813304        1555414831795  Dominance       2
9           1555414813304        1555414831795  Happiness       0
10          1555414813304        1555414831795    Sadness       1
11          1555414813304        1555414831795    Valence       3
12          1555414921819        1555414931382      Anger       1
13          1555414921819        1555414931382    Arousal       7
14          1555414921819        1555414931382  Dominance       2
15          1555414921819        1555414931382  Happiness       0
16          1555414921819        1555414931382    Sadness       1
17          1555414921819        1555414931382    Valence       1

The difference between mine and @WeNYoBen is that their answer will add zero for every unique value in name for each grouping of start and end. This may be what the OP wanted but did not explicitly say. I intentionally went through extra trouble to ensure that I only added 'Anger' and 'Happiness' if they didn't exist for a group. OP and any future readers, please choose according to your needs.
Your approach worked well. I would like to add another row to each grouping of start and end. Namely a row with name "Neutral". The rating column of the "Neutral" row should be 1 if Anger and Sadness have a rating of 0 and 0 otherwise. Is this possible?

BENY · Accepted Answer · 2019-05-09 15:20:28Z

3

I am using unstack + stack + reindex

s=set(df.name.unique().tolist()+['Anger','Happiness'])

df.set_index(df.columns[:-1].tolist()).rating.\
    unstack(fill_value=0).\
       reindex(columns=s,fill_value=0).\ 
           stack().reset_index()

edited May 9, 2019 at 15:20

answered May 9, 2019 at 15:04

BENY

324k22 gold badges176 silver badges250 bronze badges

3 Comments

Vink Over a year ago

This almost works, add a .drop_duplicates().reset_index() to get the desired output

BENY Over a year ago

@Vink thank you , i think using set is slightly better than drop_duplicate

Vink Over a year ago

agreed, safer & more precise

Collectives™ on Stack Overflow

Adding rows to groups in Pandas DataFrame

2 Answers 2

Option 1

Option 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Option 1

Option 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related