2

I have a sample DF:

sample_df = pd.DataFrame(np.random.randint(1,20,size=(6, 2)), columns=list('AB'))
sample_df["A_cat"] = ["ind","sa","sa","sa","ind","ind"]
sample_df["B_cat"] = ["sa","ind","ind","sa","sa","sa"]
sample_df

OP:

    A   B   A_cat   B_cat
0   12  8   ind     sa
1   12  11  sa      ind
2   7   19  sa      ind
3   5   11  sa      sa
4   11  7   ind     sa
5   6   18  ind     sa

I have another sample DF 2 for which I am trying to replace the column values based on a condition:

sample_df2 = pd.DataFrame()
sample_df2["A_cat"] = ["sa","ind","ind","sa","sa","ind"]
sample_df2["B_cat"] = ["ind","sa","sa","ind","sa","sa"]
sample_df2

OP:

    A_cat   B_cat
0   sa     ind
1   ind    sa
2   ind    sa
3   sa     ind
4   sa     sa
5   ind    sa

Condition:

The value in sample_df2 should be replaced by taking a groupby mean of that value in sample_df.

For example, sample_df2(0,A_cat) = sa which should be replaced by sample_df.groupby(["A_cat"])["A"].mean() for group value sa

sample OP of column A_cat in sample_df2 after conversion will be:

sample_df2["A_cat"] = [8.0000,9.666667,9.666667,8.000,8.000,9.666667]

I have done the long for loop solution for this, any suggestions for pandas approach would be great!

4
  • added sample expected output. Commented Feb 25, 2020 at 13:13
  • Are you sure the sample output is correct, and the values are not switched (i.e. 9.66 where 8 should be and vice versa)? Commented Feb 25, 2020 at 13:17
  • @ItamarMushkin Oh yea!! my bad. you are right. Commented Feb 25, 2020 at 13:17
  • I changed it now Commented Feb 25, 2020 at 13:19

2 Answers 2

1

Please check the following code and compare the elapsed time

import time

start_time = time.time()
categorical_cols = [col for col in df1.columns if len(col.split('_')) > 1]
numerical_cols = df1[df1.columns.difference(categorical_cols)].columns
unique_groups = df1['A_cat'].unique().tolist() # returns [ind, sa]

for cat, num in zip(categorical_cols, numerical_cols):
    for group in unique_groups:
        df2.loc[df2[cat] == group, cat] = df1.groupby(cat)[num].mean().loc[group]

print("Elapsed time:", time.time() - start)
Sign up to request clarification or add additional context in comments.

Comments

0

This is a bit clunky, but it works:

The actual calculations are done once:

category_value_means = sample_df2.apply(
    lambda column: sample_df.groupby(column.name)[column.name.split('_')[0]].mean())

whose output is:

    A_cat       B_cat
ind 9.666667    15
sa  8.000000    11

Afterwards, all is needed is some "lookup":

sample_df2.apply(lambda column: column.apply(lambda value: category_value_means[column.name][value]))

the output is:

    A_cat   B_cat
0   8.000000    15
1   9.666667    11
2   9.666667    11
3   8.000000    15
4   8.000000    11
5   9.666667    11

2 Comments

This is great! But its taking a long time for a huge DF any suggestions on how to speed it up?
Actually, there is. I just realized I have some major redundancy there. I've edited it accordingly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.