pandas DataFrame reshape by multiple column values

Question

I'm trying to free myself of JMP for data analysis but cannot determine the pandas equivalent of JMP's Split Columns function. I'm starting with the following DataFrame:

In [1]: df = pd.DataFrame({'Level0': [0,0,0,0,0,0,1,1,1,1,1,1], 'Level1': [0,1,0,1,0,1,0,1,0,1,0,1], 'Vals': [1,3,2,4,1,6,7,5,3,3,2,8]})
In [2]: df
Out[2]:
    Level0  Level1  Vals
0        0       0     1
1        0       1     3
2        0       0     2
3        0       1     4
4        0       0     1
5        0       1     6
6        1       0     7
7        1       1     5
8        1       0     3
9        1       1     3
10       1       0     2
11       1       1     8

I can handle some of the output scenarios of JMP's function using the pivot_table function, but I'm stumped on the case where the Vals column is split by unique combinations of Level0 and Level1 to give the following output:

Level0   0       1
Level1   0   1   0   1
0        1   3   7   5
1        2   4   3   3
2        1   6   2   8

I tried pd.pivot_table(df, values='Vals', columns=['Level0', 'Level1']) but this gives mean values for the different combinations:

Level0  Level1
0       0         1.333333
        1         4.333333
1       0         4.000000
        1         5.333333

I also tried pd.pivot_table(df, values='Vals', index=df.index, columns=['Level0', 'Level1'] which gets me the column headers I want but doesn't work because it forces the output to have the same number of rows as the original so the output has a lot of NaN values:

Level0   0       1
Level1   0   1   0   1
0        1 NaN NaN NaN
1      NaN   3 NaN NaN
2        2 NaN NaN NaN
3      NaN   4 NaN NaN
4        1 NaN NaN NaN
5      NaN   6 NaN NaN
6      NaN NaN   7 NaN
7      NaN NaN NaN   5
8      NaN NaN   3 NaN
9      NaN NaN NaN   3
10     NaN NaN   2 NaN
11     NaN NaN NaN   8

Any suggestions?

score 3 · Accepted Answer · 2016-12-10 17:12:53Z

3

It's a bit of workaround, but you can do:

df.pivot_table(index=df.groupby(['Level0', 'Level1']).cumcount(), 
               columns=['Level0', 'Level1'], values='Vals', aggfunc='first')
Out: 
Level0  0     1   
Level1  0  1  0  1
0       1  3  7  5
1       2  4  3  3
2       1  6  2  8

The idea here is that the index of the output is not readily available in the original DataFrame. You can get it with the following:

df.groupby(['Level0', 'Level1']).cumcount()
Out: 
0     0
1     0
2     1
3     1
4     2
5     2
6     0
7     0
8     1
9     1
10    2
11    2
dtype: int64

Now if you pass this as the index of the pivot_table, an arbitrary aggfunc (mean, min, max, first or last) should work for you as those index-column pairs have only one entry.

edited Dec 10, 2016 at 17:12

answered Dec 10, 2016 at 17:07

user2285236

Sign up to request clarification or add additional context in comments.

2 Comments

endangeredoxen Over a year ago

Thanks ayhan. Works great. Is the aggfunc='first' actually needed here? I get the same answer without it.

user2285236 Over a year ago

@endangeredoxen The default is 'mean' for aggfunc. Since there is a single value, it actually doesn't matter. mean, min, max, first or last, any one of them will do.

Collectives™ on Stack Overflow

pandas DataFrame reshape by multiple column values

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related