1

I have a dataframe with a LONG list of columns, some of which may not always exist depending on the data source, time of day, etc. I need to aggregate this data with min/max/avg, pct, and some counts, but any time I do this with a data frame that is missing a column, then entire aggregation function fails with an error. Is there a way to eloquently handle missing columns, by ignoring the error if it's a missing column, or maybe by creating any columns that don't exist inline?

df_aggs = df.groupby(['DeviceUUID', 'year', 'month', 'day', 'hour']).agg(

DeviceName=('DeviceName', 'first'),
DeviceType=('DeviceType', 'first'),

V_LD_SEC_A_min=('V_LD_SEC_A', np.min),
V_LD_SEC_A_avg=('V_LD_SEC_A', np.mean),
V_LD_SEC_A_max=('V_LD_SEC_A', np.max),

V_LD_SEC_B_min=('V_LD_SEC_B', np.min),
V_LD_SEC_B_avg=('V_LD_SEC_B', np.mean),
V_LD_SEC_B_max=('V_LD_SEC_B', np.max),

[many more columns ]

X_DOG_A_count=('X_DOG_A', np.count_nonzero),
X_DOG_B_count=('X_DOG_B', np.count_nonzero),
X_DOG_C_count=('X_DOG_C', np.count_nonzero),
X_DOG_count=('X_DOG', np.count_nonzero),
X_NEU_LO_count=('X_NEU_LO', np.count_nonzero),

CVR_X_ENGAGED_A_pct=('CVR_X_ENGAGED_A', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_B_pct=('CVR_X_ENGAGED_B', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_C_pct=('CVR_X_ENGAGED_C', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_3PH_pct=('CVR_X_ENGAGED_3PH',
                       lambda x: (np.sum(x) / np.size(x))*100)
).reset_index(drop=True)

If, in this example, the column 'V_LD_SEC_B' is missing from df, this entire aggregation function fails. What I'd like to get back is the df_agg with the missing column(s) added, with NaN as the value. Do I have to loop through the entire data frame, creating columns that don't exist, or can I inline their creation in some way?

1
  • use reindex(full_list_of_columns, axis=1)? Commented Jan 5, 2021 at 20:01

1 Answer 1

1

Named Aggregations allow for a variety of different syntaxes. In this case it's much better to work with the dictionary format, and then unpack to apply the aggregations to the DataFrame.

This allows us to check for the intersection between existing columns and the aggregations you want to apply and then reindex to everything, regardless of if it was there for the aggregation, in the end. Here's an example where the DataFrame is missing a 'val2' column that we may want to aggregate

import pandas as pd
import numpy as np
df = pd.DataFrame({'gp': list('abbcc'),
                   'val1': [1,2,3,4,5],
                   'val3': [2,4,6,8,10]})

# Store aggregations in a dict using output col names as keys, NamedAgg as values
aggs = {'val1_max': pd.NamedAgg(column='val1', aggfunc=np.max),
        'val1_min': pd.NamedAgg(column='val1', aggfunc=np.min),
        'val2_sum': pd.NamedAgg(column='val2', aggfunc=np.sum),
        'val3_sum': pd.NamedAgg(column='val3', aggfunc=np.sum)}


# Apply only aggregations we can, checking the column of the `NamedAgg` 
# reindex to everything we want in the end
(df.groupby('gp')
   .agg(**{k:v for k,v in aggs.items() if v.column in df.columns})
   .reindex(aggs.keys(), axis=1)
)

    val1_max  val1_min  val2_sum  val3_sum
gp                                         
a          1         1       NaN         2
b          3         2       NaN        10
c          5         4       NaN        18
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.