1

I'm trying to map the results of a 2 level aggregation to the original categorical feature and use it as a new feature. I created the aggregation like this.

temp_df = pd.concat([X_train[['cat1', 'cont1', 'cat2']], X_test[['cat1', 'cont1', 'cat2']]])
temp_df = temp_df.groupby(['cat1', 'cat2'])['cont1'].agg(['mean']).reset_index().rename(columns={'mean': 'cat1_cont1/cat2_Mean'})

Then I made MultiIndex from the values of first and second categorical feature, and finally casted the new aggregation feature to a dict.

arrays = [list(temp_df['cat1']), list(temp_df['cat2'])]    
temp_df.index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['cat1', 'cat2'])
temp_df = temp_df['cat1_cont1/cat2_Mean'].to_dict()

The dict keys are tuples as multi indices. The first values in the tuples are cat1's values and the second values are cat2's values.

{(1000, 'C'): 23.443,
 (1001, 'H'): 50.0,
 (1001, 'W'): 69.5,
 (1002, 'H'): 60.0,
 (1003, 'W'): 42.95,
 (1004, 'H'): 51.0,
 (1004, 'R'): 150.0,
 (1004, 'W'): 226.0,
 (1005, 'H'): 50.0}

When I try to map those values to the original cat1 feature, everything becomes NaN. How can I do this properly?

X_train['cat1'].map(temp_df) # Produces a column of all NaNs

1 Answer 1

1

You can map by multiple columns, but necessary create tuples from original, here by temp_df[['cat1', 'cat2']].apply(tuple, axis=1):

temp_df = pd.DataFrame({
        'cat1':list('aaaabb'),
         'cat2':[4,5,4,5,5,4],
         'cont1':[7,8,9,4,2,3],

})

new = (temp_df.groupby(['cat1', 'cat2'])['cont1'].agg(['mean'])
             .reset_index()
             .rename(columns={'mean': 'cat1_cont1/cat2_Mean'}))
print (new)
  cat1  cat2  cat1_cont1/cat2_Mean
0    a     4                     8
1    a     5                     6
2    b     4                     3
3    b     5                     2

arrays = [list(new['cat1']), list(new['cat2'])]    
new.index = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['cat1', 'cat2'])
d = new['cat1_cont1/cat2_Mean'].to_dict()
print (d)
{('a', 4): 8, ('a', 5): 6, ('b', 4): 3, ('b', 5): 2}

temp_df['cat1_cont1/cat2_Mean'] = temp_df[['cat1', 'cat2']].apply(tuple, axis=1).map(d)

For new column filled by aggregate values is simplier use GroupBy.transform function:

temp_df['cat1_cont1/cat2_Mean1'] = temp_df.groupby(['cat1', 'cat2'])['cont1'].transform('mean')

Another solution is use DataFrame.join by Series with MultiIndex:

s = temp_df.groupby(['cat1', 'cat2'])['cont1'].agg('mean').rename('cat1_cont1/cat2_Mean2')
temp_df = temp_df.join(s, on=['cat1', 'cat2'])

print (temp_df)
  cat1  cat2  cont1  cat1_cont1/cat2_Mean  cat1_cont1/cat2_Mean1  \
0    a     4      7                     8                      8   
1    a     5      8                     6                      6   
2    a     4      9                     8                      8   
3    a     5      4                     6                      6   
4    b     5      2                     2                      2   
5    b     4      3                     3                      3   

   cat1_cont1/cat2_Mean2  
0                      8  
1                      6  
2                      8  
3                      6  
4                      2  
5                      3  
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your answer. transform() function was more than enough, apparently I was overcomplicating things.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.