pandas groupby.agg() taking the mode of categorical variables where NaN is the only variable for a group

Question

I want to find the most common value for each group. UPDATE: If there are real values and NaNs, I want to drop the NaNs. I only want NaN, when that is all the values.

Some of my groups have all their data missing. And I would like the result in these cases to be missing data (NaN) as the most common value.

In these cases the DataFrame.groupby.agg(pd.Series.mode) function returns an empty categorical. What I want is NaN.

A toy example follows ...

data = """
Group, Value
A,      1
A,      1
A,      1
B,      2 
C,      3   
C, 
C, 
D,
D,
"""

from io import StringIO
df = (
    pd.read_csv(StringIO(data),
                skipinitialspace=True)
    .astype('category')
)

df.groupby('Group')['Value'].agg(pd.Series.mode)

Which yields ...

A                                             1.0
B                                             2.0
C                                             3.0
D    [], Categories (3, float64): [1.0, 2.0, 3.0]
Name: Value, dtype: object

My question: is there a way to get NAN, or to detect the empty categorical and make that a NaN. UPDATED: Noting, that I cannot use dropna=False, as that would give me an incorrect answer for C above.

By way of context, my original DataFrame has 27 million rows, and my grouped frame has 6 million rows. So, I want to avoid slow solutions.

Have you tried df.replace('', 'NaN').groupby('Group')['Value'].agg(pd.Series.mode)? — wwnde
– wwnde, Commented Apr 16, 2021 at 23:28
where I have non-NaN values I want NaN to be ignored by the mode method. So mapping NaNs to something else doesn't work in this case. — Mark Graph
– Mark Graph, Commented Apr 16, 2021 at 23:30

Andrej Kesely · Accepted Answer · 2021-04-17 00:20:23Z

4

You can apply pd.Series.mode and then pd.to_numeric with errors="coerce":

x = df.groupby("Group")["Value"].agg(pd.Series.mode)
print(pd.to_numeric(x, errors="coerce"))

Prints:

Group
A    1.0
B    2.0
C    3.0
D    NaN
Name: Value, dtype: float64

answered Apr 17, 2021 at 0:20

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pandas groupby.agg() taking the mode of categorical variables where NaN is the only variable for a group

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related