Pandas: encoding row values/transform dataframe

Question

This is a little similar to this answer but my problem is a little different.

My dataframe:

df=pd.DataFrame([{'date': '2017-01-01', 'id': 'google', 'category': 1, 'Score1': 51, 'Score2': 531},
         {'date': '2017-01-01', 'id': 'google', 'category': 2, 'Score1': 592, 'Score2': 152},
         {'date': '2017-01-01', 'id': 'google', 'category': 5, 'Score1': 55, 'Score2': 255},
         {'date': '2017-01-01', 'id': 'yahoo', 'category': 7, 'Score1': 597, 'Score2': 357},
         {'date': '2017-01-01', 'id': 'yahoo', 'category': 8, 'Score1': 58, 'Score2': 58},
         {'date': '2017-01-02', 'id': 'google', 'category': 5, 'Score1': 795, 'Score2': 455},
         {'date': '2017-01-02', 'id': 'google', 'category': 1, 'Score1': 71, 'Score2': 751},
         {'date': '2017-01-02', 'id': 'google', 'category': 2, 'Score1': 792, 'Score2': 352},
         {'date': '2017-01-02', 'id': 'yahoo', 'category': 7, 'Score1': 77, 'Score2': 957},
         {'date': '2017-01-02', 'id': 'yahoo', 'category': 8, 'Score1': 798, 'Score2': 358}
       ])

And looks like this:

         date      id  category  Score1  Score2
0  2017-01-01  google         1      51     531
1  2017-01-01  google         2     592     152
2  2017-01-01  google         5      55     255
3  2017-01-01   yahoo         7     597     357
4  2017-01-01   yahoo         8      58      58
5  2017-01-02  google         5     795     455
6  2017-01-02  google         1      71     751
7  2017-01-02  google         2     792     352
8  2017-01-02   yahoo         7      77     957
9  2017-01-02   yahoo         8     798     358

I need to transform this into a dataframe that looks like this:

date    id  cat1_score1 cat2_score1 cat5_score1 cat7_score1 cat8_score1 cat1_score2 cat2_score2 cat5_score2 cat7_score2 cat8_score2
1/1/17  google    51        592         55           0            0            531       152        255           0         0
1/1/17  yahoo      0        0            0          597          58             0         0          0           357        58
1/2/17  google    71        792        795           0            0            751       352        455           0         0
1/2/17  yahoo      0         0           0          77           798            0         0          0           957        358

The caveat here is that the number of categories can vary from id to id. There could also potentially be a secondary ID column which would need to be considered. I could enumerate over the values in the id column but then how would I transform the dataframe accordingly?

Scott Boston · Accepted Answer · 2017-10-12 21:17:58Z

3

Let's reshape like this:

df_out = df.set_index(['date', 'id', 'category']).stack().unstack([2, 3]).fillna(0)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out = df_out.add_prefix('cat').reset_index()
print(df_out)

Output:

         date      id  cat1_Score1  cat1_Score2  cat2_Score1  cat2_Score2  \
0  2017-01-01  google         51.0        531.0        592.0        152.0   
1  2017-01-01   yahoo          0.0          0.0          0.0          0.0   
2  2017-01-02  google         71.0        751.0        792.0        352.0   
3  2017-01-02   yahoo          0.0          0.0          0.0          0.0   

   cat5_Score1  cat5_Score2  cat7_Score1  cat7_Score2  cat8_Score1  \
0         55.0        255.0          0.0          0.0          0.0   
1          0.0          0.0        597.0        357.0         58.0   
2        795.0        455.0          0.0          0.0          0.0   
3          0.0          0.0         77.0        957.0        798.0   

   cat8_Score2  
0          0.0  
1         58.0  
2          0.0  
3        358.0

Explanation: First set_index on everything but scoring columns, then use stack to align all the scores vertically. Next, use unstack to pivot category and score labels to columns and fillna with zeroes. Use columns map to combine column levels into one level and add_prefix 'cat' to columns. Lastly, reset_index get all columns back into the dataframe from index.

edited Oct 12, 2017 at 21:17

answered Oct 12, 2017 at 20:46

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Craig Over a year ago

Thanks @Scott - this appears to have removed date and id from the dataframe columns though

Scott Boston Over a year ago

You can reset_index a the end to get those back into the dataframe.

Scott Boston Over a year ago

@Craig Thank you, and you're welcome. Happy coding!

Collectives™ on Stack Overflow

Pandas: encoding row values/transform dataframe

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related