2

I have two large data sets which I can't do the aggregations by combining two dataframes. I have to do the aggregation on df_train first, then map the values to the df_test.

df_train and df_test have the same exact id1 and id2, but the df_test have more samples. I'm computing the target mean on id1 and id2 and store it as a dictionary for memory issues.

target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().to_dict()

This is the output of the aggregation. The keys are tuple pairs with id1 as the first element and id2 as the second element, and the values are target means of the groups.

{(0, 0): 146.45497131347656,
 (1, 0): 74.86539459228516,
 (2, 0): 14.551384925842285,
 (3, 0): 235.5499725341797,
 (4, 0): 976.5567626953125,
 (5, 0): 17.894445419311523,
 (6, 0): 64.06660461425781,
 (7, 0): 350.33416748046875,
 (7, 1): 3097.043701171875,
 (8, 0): 256.92779541015625,
 (9, 0): 72.7147445678711 }

How can I map those values to id1 and id2 columns properly?

(There are 60 million rows of data, 1449 id1 and 4 id2 values, so speed is important)

EDIT:

df_train[['id1', 'id2']].map(target_mean)

I tried this, but map is only supported by pd.Series.

1 Answer 1

3

I think better is use DataFrame.join here:

target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().rename('avg')

df_test = df_test.join(target_mean, on=['id1', 'id2'])

Your solution is possible, but I guess slowier with map by MultiIndex:

target_mean = df_train.groupby(['id1', 'id2'])['target'].mean().to_dict()
df_test['avg'] = df_test.set_index(['id1', 'id2']).index.map(target_mean)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, join works like magic. What is difference between merge and join in this case? Should I add how='left' to join?
@gunesevitan - how='left' is not necessary, because default parameter in join. Difference between merge and join is mainly join by default use index for merging, merge by default by columns names specified in on parameter

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.