Efficiently replace column value from a column of dict from pandas dict

Question

I would like help vectorizing my current code, any help or comments are appreiciated I have a df with a weird column that is derived from an availability checker function like this:

original_df = pd.DataFrame({
    'a':['a1', 'a2', 'a3', 'a4'],
    'b':['b1', 'b20', 'b98', 'b4'],
    'c':[{'a':'not_available', 'b': 'b1'}, {}, {'a':'a3', 'b': 'b98'}, {'a':'not_available', 'b': 'not_available'}],
})

a	b	c
a1	b1	`{'a': 'not_available', 'b': 'b11'}`
a2	b20	`{}`
a3	b98	`{'a': 'a3', 'b': 'b98'}`
a4	b4	`{'a': 'not_available', 'b': 'not_available'}`

I would like to transform the columns a and b based on the dictionary of column c So, the resulting DF looks something like this:

desired_df = pd.DataFrame({
    'a':['not_available', 'a2', 'a3', 'not_available'],
    'b':['b1', 'b20', 'b98', 'not_available']})

a	b
not_available	b1
a2	b20
a3	b98
not_available	not_available

Some things to note: if the dict in column c is empty, leave the values as they are in other columns. The values in dict of c can only be the current value in the other colum or not_available.

for idx, row in original_df.iterrows():
    for key, value in row.c.items():
        original_df.loc[idx, key] = value

This is a downsampled scenario, the dict contains 8 columns and the df usually has 20-60 rows.

This is my current code and it works but it is very slow. This code is used in an API and my profiler tells me that this function consumes the highest cumulative time. Which makes sense since I'm iterating over everything and I was hoping to get some help!

Shubams answer has made this function go from 20secs to 0.208 seconds. Thank you!

Shubham Sharma · Accepted Answer · 2022-12-18 02:59:17Z

1

Create an intermediate dataframe from the dicts in column c then mask the values not equal to not_available and use fillna to fill the null values from original df

out = pd.DataFrame([*original_df['c']], original_df.index)
out = out[out == 'not_available'].fillna(original_df)

print(out)
               a              b
0  not_available             b1
1             a2            b20
2             a3            b98
3  not_available  not_available

answered Dec 18, 2022 at 2:59

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Harsha Over a year ago

Thank you @Shubam Sharma I realize this was not provided in the question: What if I had more columns in original_df that I wanted to preserve? Something like id, should I just do out['id'] = original_df['id']?

Shubham Sharma Over a year ago

In that case we can use combine_first instead of fillna

Harsha Over a year ago

You are the best, that would definitely be better!

Harsha Over a year ago

I'm about to run my profiler against your code, I'll keep you updated

Harsha Over a year ago

I went from 20secs to 0.208 seconds. Thank you!

|

Panda Kim · Accepted Answer · 2022-12-18 02:57:22Z

1

dict_series = original_df[['a', 'b']].apply(lambda x: dict(x), axis=1)
s = original_df['c']
out = s.where(s.astype(bool), dict_series).apply(lambda x: pd.Series(x))

out

    a               b
0   not_available   b1
1   a2              b20
2   a3              b98
3   not_available   not_available

answered Dec 18, 2022 at 2:57

Panda Kim

13.7k2 gold badges8 silver badges15 bronze badges

Collectives™ on Stack Overflow

Efficiently replace column value from a column of dict from pandas dict

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related