Replace pandas column values based on another DF

Question

I have a sample DF:

sample_df = pd.DataFrame(np.random.randint(1,20,size=(6, 2)), columns=list('AB'))
sample_df["A_cat"] = ["ind","sa","sa","sa","ind","ind"]
sample_df["B_cat"] = ["sa","ind","ind","sa","sa","sa"]
sample_df

OP:

    A   B   A_cat   B_cat
0   12  8   ind     sa
1   12  11  sa      ind
2   7   19  sa      ind
3   5   11  sa      sa
4   11  7   ind     sa
5   6   18  ind     sa

I have another sample DF 2 for which I am trying to replace the column values based on a condition:

sample_df2 = pd.DataFrame()
sample_df2["A_cat"] = ["sa","ind","ind","sa","sa","ind"]
sample_df2["B_cat"] = ["ind","sa","sa","ind","sa","sa"]
sample_df2

OP:

    A_cat   B_cat
0   sa     ind
1   ind    sa
2   ind    sa
3   sa     ind
4   sa     sa
5   ind    sa

Condition:

The value in sample_df2 should be replaced by taking a groupby mean of that value in sample_df.

For example, sample_df2(0,A_cat) = sa which should be replaced by sample_df.groupby(["A_cat"])["A"].mean() for group value sa

sample OP of column A_cat in sample_df2 after conversion will be:

sample_df2["A_cat"] = [8.0000,9.666667,9.666667,8.000,8.000,9.666667]

I have done the long for loop solution for this, any suggestions for pandas approach would be great!

Are you sure the sample output is correct, and the values are not switched (i.e. 9.66 where 8 should be and vice versa)? — Itamar Mushkin
– Itamar Mushkin, Commented Feb 25, 2020 at 13:17

isabella · Accepted Answer · 2020-02-25 19:47:48Z

1

Please check the following code and compare the elapsed time

import time

start_time = time.time()
categorical_cols = [col for col in df1.columns if len(col.split('_')) > 1]
numerical_cols = df1[df1.columns.difference(categorical_cols)].columns
unique_groups = df1['A_cat'].unique().tolist() # returns [ind, sa]

for cat, num in zip(categorical_cols, numerical_cols):
    for group in unique_groups:
        df2.loc[df2[cat] == group, cat] = df1.groupby(cat)[num].mean().loc[group]

print("Elapsed time:", time.time() - start)

answered Feb 25, 2020 at 19:47

isabella

4773 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Itamar Mushkin · Accepted Answer · 2020-02-27 06:58:53Z

0

This is a bit clunky, but it works:

The actual calculations are done once:

category_value_means = sample_df2.apply(
    lambda column: sample_df.groupby(column.name)[column.name.split('_')[0]].mean())

whose output is:

    A_cat       B_cat
ind 9.666667    15
sa  8.000000    11

Afterwards, all is needed is some "lookup":

sample_df2.apply(lambda column: column.apply(lambda value: category_value_means[column.name][value]))

the output is:

    A_cat   B_cat
0   8.000000    15
1   9.666667    11
2   9.666667    11
3   8.000000    15
4   8.000000    11
5   9.666667    11

edited Feb 27, 2020 at 6:58

answered Feb 25, 2020 at 13:18

Itamar Mushkin

2,9232 gold badges19 silver badges34 bronze badges

2 Comments

data_person Over a year ago

This is great! But its taking a long time for a huge DF any suggestions on how to speed it up?

Itamar Mushkin Over a year ago

Actually, there is. I just realized I have some major redundancy there. I've edited it accordingly.

Collectives™ on Stack Overflow

Replace pandas column values based on another DF

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related