Handle missing DataFrame columns in groupBy aggregation

Question

I have a dataframe with a LONG list of columns, some of which may not always exist depending on the data source, time of day, etc. I need to aggregate this data with min/max/avg, pct, and some counts, but any time I do this with a data frame that is missing a column, then entire aggregation function fails with an error. Is there a way to eloquently handle missing columns, by ignoring the error if it's a missing column, or maybe by creating any columns that don't exist inline?

df_aggs = df.groupby(['DeviceUUID', 'year', 'month', 'day', 'hour']).agg(

DeviceName=('DeviceName', 'first'),
DeviceType=('DeviceType', 'first'),

V_LD_SEC_A_min=('V_LD_SEC_A', np.min),
V_LD_SEC_A_avg=('V_LD_SEC_A', np.mean),
V_LD_SEC_A_max=('V_LD_SEC_A', np.max),

V_LD_SEC_B_min=('V_LD_SEC_B', np.min),
V_LD_SEC_B_avg=('V_LD_SEC_B', np.mean),
V_LD_SEC_B_max=('V_LD_SEC_B', np.max),

[many more columns ]

X_DOG_A_count=('X_DOG_A', np.count_nonzero),
X_DOG_B_count=('X_DOG_B', np.count_nonzero),
X_DOG_C_count=('X_DOG_C', np.count_nonzero),
X_DOG_count=('X_DOG', np.count_nonzero),
X_NEU_LO_count=('X_NEU_LO', np.count_nonzero),

CVR_X_ENGAGED_A_pct=('CVR_X_ENGAGED_A', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_B_pct=('CVR_X_ENGAGED_B', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_C_pct=('CVR_X_ENGAGED_C', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_3PH_pct=('CVR_X_ENGAGED_3PH',
                       lambda x: (np.sum(x) / np.size(x))*100)
).reset_index(drop=True)

If, in this example, the column 'V_LD_SEC_B' is missing from df, this entire aggregation function fails. What I'd like to get back is the df_agg with the missing column(s) added, with NaN as the value. Do I have to loop through the entire data frame, creating columns that don't exist, or can I inline their creation in some way?

use reindex(full_list_of_columns, axis=1)?

Quang Hoang
– Quang Hoang

2021-01-05 20:01:15 +00:00
Commented Jan 5, 2021 at 20:01 — Quang Hoang
– Quang Hoang, Commented Jan 5, 2021 at 20:01

ALollz · Accepted Answer · 2021-01-05 20:26:06Z

Named Aggregations allow for a variety of different syntaxes. In this case it's much better to work with the dictionary format, and then unpack to apply the aggregations to the DataFrame.

This allows us to check for the intersection between existing columns and the aggregations you want to apply and then reindex to everything, regardless of if it was there for the aggregation, in the end. Here's an example where the DataFrame is missing a 'val2' column that we may want to aggregate

import pandas as pd
import numpy as np
df = pd.DataFrame({'gp': list('abbcc'),
                   'val1': [1,2,3,4,5],
                   'val3': [2,4,6,8,10]})

# Store aggregations in a dict using output col names as keys, NamedAgg as values
aggs = {'val1_max': pd.NamedAgg(column='val1', aggfunc=np.max),
        'val1_min': pd.NamedAgg(column='val1', aggfunc=np.min),
        'val2_sum': pd.NamedAgg(column='val2', aggfunc=np.sum),
        'val3_sum': pd.NamedAgg(column='val3', aggfunc=np.sum)}


# Apply only aggregations we can, checking the column of the `NamedAgg` 
# reindex to everything we want in the end
(df.groupby('gp')
   .agg(**{k:v for k,v in aggs.items() if v.column in df.columns})
   .reindex(aggs.keys(), axis=1)
)

    val1_max  val1_min  val2_sum  val3_sum
gp                                         
a          1         1       NaN         2
b          3         2       NaN        10
c          5         4       NaN        18

Collectives™ on Stack Overflow

Handle missing DataFrame columns in groupBy aggregation

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related