3

I'm trying to manipulate my data frame similar to how you would using SQL window functions. Consider the following sample set:

import pandas as pd

df = pd.DataFrame({'fruit' : ['apple', 'apple', 'apple', 'orange', 'orange', 'orange', 'grape', 'grape', 'grape'],
               'test' : [1, 2, 1, 1, 2, 1, 1, 2, 1],
               'analysis' : ['full', 'full', 'partial', 'full', 'full', 'partial', 'full', 'full', 'partial'],
               'first_pass' : [12.1, 7.1, 14.3, 19.1, 17.1, 23.4, 23.1, 17.2, 19.1],
               'second_pass' : [20.1, 12.0, 13.1, 20.1, 18.5, 22.7, 14.1, 17.1, 19.4],
               'units' : ['g', 'g', 'g', 'g', 'g', 'g', 'g', 'g', 'g'],
               'order' : [2, 1, 3, 2, 1, 3, 3, 2, 1]})
+--------+------+----------+------------+-------------+-------+-------+
| fruit  | test | analysis | first_pass | second_pass | order | units |
+--------+------+----------+------------+-------------+-------+-------+
| apple  |    1 | full     | 12.1       | 20.1        |     2 | g     |
| apple  |    2 | full     | 7.1        | 12.0        |     1 | g     |
| apple  |    1 | partial  | 14.3       | 13.1        |     3 | g     |
| orange |    1 | full     | 19.1       | 20.1        |     2 | g     |
| orange |    2 | full     | 17.1       | 18.5        |     1 | g     |
| orange |    1 | partial  | 23.4       | 22.7        |     3 | g     |
| grape  |    1 | full     | 23.1       | 14.1        |     3 | g     |
| grape  |    2 | full     | 17.2       | 17.1        |     2 | g     |
| grape  |    1 | partial  | 19.1       | 19.4        |     1 | g     |
+--------+------+----------+------------+-------------+-------+-------+

I'd like to add a few columns:

  • a boolean column to indicate whether the second_pass value for that test and analysis is the highest amongst all fruit types.
  • another column that lists which fruits had the highest second_pass values for each test and analysis combination.

Using this logic, I'd like to get the following table:

+--------+------+----------+------------+-------------+-------+-------+---------+---------------------+
| fruit  | test | analysis | first_pass | second_pass | order | units | highest |   highest_fruits    |
+--------+------+----------+------------+-------------+-------+-------+---------+---------------------+
| apple  |    1 | full     | 12.1       | 20.1        |     2 | g     | true    | ["apple", "orange"] |
| apple  |    2 | full     | 7.1        | 12.0        |     1 | g     | false   | ["orange"]          |
| apple  |    1 | partial  | 14.3       | 13.1        |     3 | g     | false   | ["orange"]          |
| orange |    1 | full     | 19.1       | 20.1        |     2 | g     | true    | ["apple", "orange"] |
| orange |    2 | full     | 17.1       | 18.5        |     1 | g     | true    | ["orange"]          |
| orange |    1 | partial  | 23.4       | 22.7        |     3 | g     | true    | ["orange"]          |
| grape  |    1 | full     | 23.1       | 22.1        |     3 | g     | false   | ["orange"]          |
| grape  |    2 | full     | 17.2       | 17.1        |     2 | g     | false   | ["orange"]          |
| grape  |    1 | partial  | 19.1       | 19.4        |     1 | g     | false   | ["orange"]          |
+--------+------+----------+------------+-------------+-------+-------+---------+---------------------+

I'm new to pandas, so I'm sure I'm missing something very simple.

1
  • Wish I can help further but swamped at the moment. Off the top of my head, g = df.groupby(['test','analysis'])['second_pass'].agg('idxmax') will give you the indices of the rows with the maximum value for second_pass grouped by test and analysis. I don't know right now if it can detect ties though. Commented Jan 12, 2016 at 5:26

2 Answers 2

1

You could return boolean values where second_pass equals the group max, as idxmax only returns the first occurrence of the max:

df['highest'] = df.groupby(['test', 'analysis'])['second_pass'].transform(lambda x: x == np.amax(x)).astype(bool)

and then use np.where to capture all fruit values that have a group max, and merge the result into your DataFrame like so:

highest_fruits = df.groupby(['test', 'analysis']).apply(lambda x: [f for f in np.where(x.second_pass == np.amax(x.second_pass), x.fruit.tolist(), '').tolist() if f!='']).reset_index()
df =df.merge(highest_fruits, on=['test', 'analysis'], how='left').rename(columns={0: 'highest_fruit'})

finally, for your follow up:

first_pass = df.groupby(['test', 'analysis']).apply(lambda x: {fruit: x.loc[x.fruit==fruit, 'first_pass'] for fruit in x.highest_fruit.iloc[0]}).reset_index()
df =df.merge(first_pass, on=['test', 'analysis'], how='left').rename(columns={0: 'first_pass_highest_fruit'})

to get:

  analysis  first_pass   fruit  order  second_pass  test units highest  \
0     full        12.1   apple      2         20.1     1     g    True   
1     full         7.1   apple      1         12.0     2     g   False   
2  partial        14.3   apple      3         13.1     1     g   False   
3     full        19.1  orange      2         20.1     1     g    True   
4     full        17.1  orange      1         18.5     2     g    True   
5  partial        23.4  orange      3         22.7     1     g    True   
6     full        23.1   grape      3         14.1     1     g   False   
7     full        17.2   grape      2         17.1     2     g   False   
8  partial        19.1   grape      1         19.4     1     g   False   

     highest_fruit             first_pass_highest_fruit  
0  [apple, orange]  {'orange': [19.1], 'apple': [12.1]}  
1         [orange]                   {'orange': [17.1]}  
2         [orange]                   {'orange': [23.4]}  
3  [apple, orange]  {'orange': [19.1], 'apple': [12.1]}  
4         [orange]                   {'orange': [17.1]}  
5         [orange]                   {'orange': [23.4]}  
6  [apple, orange]  {'orange': [19.1], 'apple': [12.1]}  
7         [orange]                   {'orange': [17.1]}  
8         [orange]                   {'orange': [23.4]} 
Sign up to request clarification or add additional context in comments.

2 Comments

As a follow up question, is there a way to pull first_pass values into a new column based off of the highest_fruits. For example, grapes' full analysis and test 1 (idx: 6) yields [apple, orange] as the highest fruits. If I wanted to pull apple's and orange's first_pass values into a new column using that test/analysis combination (resulting in [12.1, 19.1], is there a pandas way?
Such a powerful module. Thanks Stefan!
0

I'm going to assume you meant

'test' : [1, 2, 3, 1, 2, 3, 1, 2, 3]

To generate your first column, you can group by the test number, and compare each second pass score to the max score:

df['highest'] = df['second_pass'] == df.groupby('test')['second_pass'].transform('max')

For the second part, I don't have a clean solution, but here's a bit of an ugly one, first set the index to fruit:

df = df.set_index('fruit')

Next, find which rows have 'highest' set to True for each test, and return the a list of the indices that those rows have (which are the names of the fruits):

test1_max_fruits = df[df['test']==1&df['highest']].index.values.tolist()
test2_max_fruits = df[df['test']==2&df['highest']].index.values.tolist()
test3_max_fruits = df[df['test']==3&df['highest']].index.values.tolist()

Define a function to look at the test number then return the corresponding max_fruits that we just generated:

def max_fruits(test_num):

    if test_num == 1:
    return test1_max_fruits

    if test_num == 2:
    return test2_max_fruits

    if test_num == 3:
    return test3_max_fruits

Create a column and apply this function over your 'test' column:

df['highest_fruits'] = df['test'].apply(max_fruits)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.