4

I'm trying to free myself of JMP for data analysis but cannot determine the pandas equivalent of JMP's Split Columns function. I'm starting with the following DataFrame:

In [1]: df = pd.DataFrame({'Level0': [0,0,0,0,0,0,1,1,1,1,1,1], 'Level1': [0,1,0,1,0,1,0,1,0,1,0,1], 'Vals': [1,3,2,4,1,6,7,5,3,3,2,8]})
In [2]: df
Out[2]:
    Level0  Level1  Vals
0        0       0     1
1        0       1     3
2        0       0     2
3        0       1     4
4        0       0     1
5        0       1     6
6        1       0     7
7        1       1     5
8        1       0     3
9        1       1     3
10       1       0     2
11       1       1     8

I can handle some of the output scenarios of JMP's function using the pivot_table function, but I'm stumped on the case where the Vals column is split by unique combinations of Level0 and Level1 to give the following output:

Level0   0       1
Level1   0   1   0   1
0        1   3   7   5
1        2   4   3   3
2        1   6   2   8

I tried pd.pivot_table(df, values='Vals', columns=['Level0', 'Level1']) but this gives mean values for the different combinations:

Level0  Level1
0       0         1.333333
        1         4.333333
1       0         4.000000
        1         5.333333

I also tried pd.pivot_table(df, values='Vals', index=df.index, columns=['Level0', 'Level1'] which gets me the column headers I want but doesn't work because it forces the output to have the same number of rows as the original so the output has a lot of NaN values:

Level0   0       1
Level1   0   1   0   1
0        1 NaN NaN NaN
1      NaN   3 NaN NaN
2        2 NaN NaN NaN
3      NaN   4 NaN NaN
4        1 NaN NaN NaN
5      NaN   6 NaN NaN
6      NaN NaN   7 NaN
7      NaN NaN NaN   5
8      NaN NaN   3 NaN
9      NaN NaN NaN   3
10     NaN NaN   2 NaN
11     NaN NaN NaN   8

Any suggestions?

1 Answer 1

3

It's a bit of workaround, but you can do:

df.pivot_table(index=df.groupby(['Level0', 'Level1']).cumcount(), 
               columns=['Level0', 'Level1'], values='Vals', aggfunc='first')
Out: 
Level0  0     1   
Level1  0  1  0  1
0       1  3  7  5
1       2  4  3  3
2       1  6  2  8

The idea here is that the index of the output is not readily available in the original DataFrame. You can get it with the following:

df.groupby(['Level0', 'Level1']).cumcount()
Out: 
0     0
1     0
2     1
3     1
4     2
5     2
6     0
7     0
8     1
9     1
10    2
11    2
dtype: int64

Now if you pass this as the index of the pivot_table, an arbitrary aggfunc (mean, min, max, first or last) should work for you as those index-column pairs have only one entry.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks ayhan. Works great. Is the aggfunc='first' actually needed here? I get the same answer without it.
@endangeredoxen The default is 'mean' for aggfunc. Since there is a single value, it actually doesn't matter. mean, min, max, first or last, any one of them will do.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.