1

I have the following dataframe. I want to group by a and b first. Within each group, I need to do a value count based on c and only pick the one with most counts. If there are more than one c values for one group with the most counts, just pick any one.

a    b    c
1    1    x
1    1    y
1    1    y
1    2    y
1    2    y
1    2    z
2    1    z
2    1    z
2    1    a
2    1    a

The expected result would be

a    b    c
1    1    y
1    2    y
2    1    z

What is the right way to do it? It would be even better if I can print out each group with c's value counts sorted as an intermediate step.

4
  • 3
    why does a = 2 has 2 entries z and a for same b=1 ? Commented Apr 10, 2020 at 15:52
  • @anky there are lots of duplicates in the data, not just for a and b but for a, b and c too. Part of the reason of doing this is to remove most of the duplicates Commented Apr 10, 2020 at 16:04
  • I get your point , but as per your question - If there are more than one c values for one group with the most counts, just pick any one , so a=2 and b=1 group has both z and a appearing twice , hence shouldnt just 1 be taken in the output? Commented Apr 10, 2020 at 16:06
  • What exactly is the issue? Have you tried anything, done any research? Commented Apr 10, 2020 at 17:19

3 Answers 3

8

You are looking for .value_counts():

df.groupby(['a', 'b'])['c'].value_counts()
a  b  c
1  1  y    2
      x    1
   2  y    2
      z    1
2  1  a    2
      z    2
Name: c, dtype: int64
Sign up to request clarification or add additional context in comments.

Comments

3

group the original dataframe by ['a', 'b'] and get the .max() should work

df.groupby(['a', 'b'])['c'].max()

you can also aggregate 'count' and 'max' values

df.groupby(['a', 'b'])['c'].agg({'max': max, 'count': 'count'}).reset_index()

1 Comment

I know this is the final result I wanted but is there a way to sort c's occurrences within a group first?
1

Try:

df=df.groupby(["a", "b", "c"])["c"].count().sort_values(ascending=False).reset_index(name="dropme").drop_duplicates(subset=["a", "b"], keep="first").drop("dropme", axis=1)

Outputs:

   a  b  c
0  2  1  z
2  1  2  y
3  1  1  y

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.