I want to find the most common value for each group. UPDATE: If there are real values and NaNs, I want to drop the NaNs. I only want NaN, when that is all the values.
Some of my groups have all their data missing. And I would like the result in these cases to be missing data (NaN) as the most common value.
In these cases the DataFrame.groupby.agg(pd.Series.mode) function returns an empty categorical. What I want is NaN.
A toy example follows ...
data = """
Group, Value
A, 1
A, 1
A, 1
B, 2
C, 3
C,
C,
D,
D,
"""
from io import StringIO
df = (
pd.read_csv(StringIO(data),
skipinitialspace=True)
.astype('category')
)
df.groupby('Group')['Value'].agg(pd.Series.mode)
Which yields ...
A 1.0
B 2.0
C 3.0
D [], Categories (3, float64): [1.0, 2.0, 3.0]
Name: Value, dtype: object
My question: is there a way to get NAN, or to detect the empty categorical and make that a NaN. UPDATED: Noting, that I cannot use dropna=False, as that would give me an incorrect answer for C above.
By way of context, my original DataFrame has 27 million rows, and my grouped frame has 6 million rows. So, I want to avoid slow solutions.
df.replace('', 'NaN').groupby('Group')['Value'].agg(pd.Series.mode)?