How to groupby multiple columns with count unique value in Python Pandas

Question

I have a DataFrame df_data:

CustID    MatchID    LocationID   isMajor  #Major is 1 and Minor is 0
  1        11111       324         0  
  1        11111       324         0
  1        11111       324         0
  1        22222       490         0
  1        33333       675         1
  2        44444       888         0

I have a function and parameter like this:

def compute_something(list_minor = None, list_major = None):
   return pass

Explain Parameters: with CustID = 1 the parameters should be list_minor = [3,1] (position is not important), list_major = [1] because with LocationID = 324 he get 3 times and LocationID = 490 he get 1 time (324,490 gets isMajor = 0 so it should be into 1 list). Similiar, CustID2 have parameters list_minor = [1] and list_major = [] (if he don't have data major/minor, I should be pass [].

This is my program:

data = [
    [1, 11111, 324, 0],
    [1, 11111, 324, 0],
    [1, 11111, 324, 0],
    [1, 22222, 490, 0],
    [1, 33333, 675, 1],
    [2, 44444, 888, 0]
]
df_data = pd.DataFrame(data, columns = ['CustID','MatchID','LocationID','IsMajor'])
df_parameter = DataFrame()

df_parameter['parameters'] = df.groupby(['CustID','MatchID','IsMajor'])['LeagueID'].nunique()

But results of df_parameter['parameters'] is wrong:

                                    parameters
 CustID     MatchID    IsMajor
   1         11111        0             1   #should be 3
             22222        0             1
             33333        1             1
   2         44444        0             1

Can I get the parameters I explained above with groupby and pass them to the function?

Quang Hoang · Accepted Answer · 2021-01-27 04:45:24Z

1

How about:

(df.groupby(['CustID','isMajor', 'MatchID']).size()
   .groupby(level=[0,1]).agg(set)
   .unstack('isMajor')
)

Output:

isMajor       0    1
CustID              
1        {1, 3}  {1}
2           {1}  NaN

Update Try this one groupby:

(df.groupby(['CustID','isMajor'])['MatchID']
   .apply(lambda x: x.value_counts().agg(list))
   .unstack('isMajor')
)

Also, groupby with two keys can be slow. In that case, you can just concatenate the keys and groupby on that:

keys = df['CustID'].astype(str) + '_' + df['isMajor'].astype(str)

(df.groupby(keys)['MatchID']
   .apply(lambda x: x.value_counts().agg(list))
)

edited Jan 27, 2021 at 4:45

answered Jan 27, 2021 at 2:10

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Toan Nguyen Phuoc Over a year ago

hi @Quang Hoang, your result is different from what I expected result. With CustID = 1 I want to get value of isMajor = 0 is [3,1] but in your solution I just get 3.

Toan Nguyen Phuoc Over a year ago

I forgot each column with each position in groupby will get a different result!!!

Toan Nguyen Phuoc Over a year ago

Hi @Quang Hoang, Can I ask you something? I saw you used to unstack to change index and column so I check your Output with info() and I see column names are 0 and 1, but when I get data from the column name, I get a KeyError, It's so weird.

Quang Hoang Over a year ago

Depending of isMajor type. They might be '0' not 0.

Toan Nguyen Phuoc Over a year ago

I'm testing your solution in ~ 20 mil data and it very slowly, I'm waiting for 20p and it has not completed. Do you have other ideas?

|

Collectives™ on Stack Overflow

How to groupby multiple columns with count unique value in Python Pandas

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related