40

Suppose I have pandas data frame with 2 columns:

df: Col1  Col2
      1     1
      1     2
      1     2
      1     2
      3     4
      3     4

Then I want to keep only the unique couple values (col1, col2) of these two columns and give their frequncy:

df2: Col1  Col2  Freq
      1     1     1
      1     2     3
      3     4     2

I think to use df['Col1', 'Col2'].value_counts() but it works only for one column. Does it exist a function to deal with many columns?

2
  • 6
    df.groupby(['Col1', 'Col2']).size()? Commented Jul 4, 2017 at 13:01
  • 1
    Ambiguous title: this does not find the unique values in either Col1 or Col2, but the unique combinations of values in both Col1 and Col2, i.e. the Cartesian product. This might not be what you want, esp, for columns with higher cardinality than boolean (only two values). Commented Apr 8, 2020 at 21:33

2 Answers 2

73

You need groupby + size + Series.reset_index:

df = df.groupby(['Col1', 'Col2']).size().reset_index(name='Freq')
print (df)
   Col1  Col2  Freq
0     1     1     1
1     1     2     3
2     3     4     2
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks for the amazing answer. I'm trying to understand your answer by running it bit by bit and I have a couple of questions: 1. What shall I use if I only need the Col1 and Col2, namely I only need the unique pairs of value for the first two columns, would your answer still be the most optimal method? 2. Why does df.groupby(['Col1', 'Col2']).size() return data series for me? Thanks again.
@BowenLiu - 1. I think it is really fast, maybe some numpy solution should be faster. 2. In my opinion it return Series by design - there is not necessary another column like aggregating mean, sum (df.groupby(['Col1', 'Col2'])['Col3'].sum()), because output is counted by columns define in groupby - Col1 and Col3 - it grouping and also count in same columns. For sum it grouping by Col1 and Col2 and aggregate Col3 - column(s) in list after groupby or if omited like df.groupby(['Col1', 'Col2']).sum() it aggregate sum in all columns.
Can't believe I didn't see your reply. Reading it after using pandas for several months makes a lot more sense for me now. The only thing I still don't get is the reset_index(name = 'Freq') part. In the pandas documentation, name is not a kwarg for reset_index. How did you get name the column that was not an index in the groupby result in this way? Thanks.
@BowenLiu - oops, there is bad link, need Series.reset_index - name parameter working only with Series
Thanks a lot. I realize my understanding of group is quite superficial therefore am trying to deduct some general rules of it. Are there any other aggregate functions, like .size(), that can generate a series without specifying columns (in the format of df.groupby(['Col1']).function()). BTW many of your posts have proved immensely helpful to me. I wonder if you can share how you manage to have such a deep and systematic understanding of Pandas.
|
15

You could try

df.groupby(['Col1', 'Col2']).size()

for a different visual output in comparison to jez's answer, you can extend that solution with

pd.DataFrame(df.groupby(['Col1', 'Col2']).size().rename('Freq'))

gives

           Freq
Col1 Col2      
1    1        1
     2        3
3    4        2

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.