1

I would like help vectorizing my current code, any help or comments are appreiciated I have a df with a weird column that is derived from an availability checker function like this:

original_df = pd.DataFrame({
    'a':['a1', 'a2', 'a3', 'a4'],
    'b':['b1', 'b20', 'b98', 'b4'],
    'c':[{'a':'not_available', 'b': 'b1'}, {}, {'a':'a3', 'b': 'b98'}, {'a':'not_available', 'b': 'not_available'}],
})
a b c
a1 b1 {'a': 'not_available', 'b': 'b11'}
a2 b20 {}
a3 b98 {'a': 'a3', 'b': 'b98'}
a4 b4 {'a': 'not_available', 'b': 'not_available'}

I would like to transform the columns a and b based on the dictionary of column c So, the resulting DF looks something like this:

desired_df = pd.DataFrame({
    'a':['not_available', 'a2', 'a3', 'not_available'],
    'b':['b1', 'b20', 'b98', 'not_available']})
a b
not_available b1
a2 b20
a3 b98
not_available not_available

Some things to note: if the dict in column c is empty, leave the values as they are in other columns. The values in dict of c can only be the current value in the other colum or not_available.

for idx, row in original_df.iterrows():
    for key, value in row.c.items():
        original_df.loc[idx, key] = value

This is a downsampled scenario, the dict contains 8 columns and the df usually has 20-60 rows.

This is my current code and it works but it is very slow. This code is used in an API and my profiler tells me that this function consumes the highest cumulative time. Which makes sense since I'm iterating over everything and I was hoping to get some help!

Shubams answer has made this function go from 20secs to 0.208 seconds. Thank you!

2 Answers 2

1

Create an intermediate dataframe from the dicts in column c then mask the values not equal to not_available and use fillna to fill the null values from original df

out = pd.DataFrame([*original_df['c']], original_df.index)
out = out[out == 'not_available'].fillna(original_df)

print(out)
               a              b
0  not_available             b1
1             a2            b20
2             a3            b98
3  not_available  not_available
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you @Shubam Sharma I realize this was not provided in the question: What if I had more columns in original_df that I wanted to preserve? Something like id, should I just do out['id'] = original_df['id']?
In that case we can use combine_first instead of fillna
You are the best, that would definitely be better!
I'm about to run my profiler against your code, I'll keep you updated
I went from 20secs to 0.208 seconds. Thank you!
|
1
dict_series = original_df[['a', 'b']].apply(lambda x: dict(x), axis=1)
s = original_df['c']
out = s.where(s.astype(bool), dict_series).apply(lambda x: pd.Series(x))

out

    a               b
0   not_available   b1
1   a2              b20
2   a3              b98
3   not_available   not_available

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.