Pandas change values in column based on values in other column

Question

I have a dataframe in which one column represents some data, the other column represents indices on which I want to delete from my data. So starting from this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data       to_delete
     [1,2,3,4]    [2]
     [0,1,2]     [0,2]

This is what I want to end up with:

new_df
>>>>   data     to_delete
     [1,2,4]       [2]
       [1]        [0,2]

I could iterate over the rows by hand and calculate the new data for each one like this:

new_data = []
for _,v in df.iterrows():
    foo = np.delete(v['data'],v['to_delete'])
    new_data.append(foo)
df.assign(data=new_data)

but I'm looking for a better way to do this.

I do have numpy arrays. But would the methods differ that much if I had lists? — emilaz
– emilaz, Commented Apr 7, 2020 at 21:16

yatu · Accepted Answer · 2020-04-07 22:23:21Z

2

The overhead from calling a numpy function for each row will really worsen the performance here. I'd suggest you to go with lists instead:

df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]] 
              for i in df.values]

print(df)

       data to_delete
0  [1, 2, 4]       [2]
1        [1]    [0, 2]

Timings on a 20K row dataframe:

df_large = pd.concat([df]*10000, axis=0)

%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]] 
         for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit 
new_data = []
for _,v in df_large.iterrows():
    foo = np.delete(v['data'],v['to_delete'])
    new_data.append(foo)

# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_large.apply(lambda row: np.delete(row["data"], 
                       row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Apr 7, 2020 at 22:23

answered Apr 7, 2020 at 21:20

yatu

88.6k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

emilaz Over a year ago

this is pretty great insight, thank you. The numpy arrays are coming from further down in my actual pipeline and are used further up as well. Not sure how conversions would affect execution time.

yatu Over a year ago

Actually for this approach dealing with lists or arrays doesn't really change anything. As mentioned though, calling a numpy function on each row is a really bad idea, and will result in a poor performance. Check the timings on a not so large 20K row df @emilaz

emilaz Over a year ago

It does for my output, or rather the data type of my new data column values. But that is easily fixed by using [np.array([j for ix, j in enumerate(i[0]) if ix not in i[1]]) for i in df_large.values]. I will use your approach, thanks again!

yatu Over a year ago

Yes, exactly in that case just construct an array from the list, hope this helps @emilaz

yatu Over a year ago

Also I'll add that having a dataframe of np.arrays is not a good idea at all, performance-wise. If you can, stick to either lists, or dataframes. Numpy arrays are not useful either when they are not homogeneous (different number of rows items per row) @emilaz

Daniel Geffen · Accepted Answer · 2020-04-07 21:19:27Z

1

You should use the apply function in order to apply a function to every row in the dataframe:

df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)

edited Apr 7, 2020 at 21:19

answered Apr 7, 2020 at 21:15

Daniel Geffen

1,8721 gold badge12 silver badges17 bronze badges

1 Comment

emilaz Over a year ago

That is throwing a KeyError for me: KeyError: 'data'

swiss_knight · Accepted Answer · 2020-04-07 21:48:48Z

0

An other solution based on starmap:

This solution is based on a less known tool from the itertools module called starmap.

Check its doc, it's worth a try!

import pandas as pd
import numpy as np
from itertools import starmap

df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
                   'to_delete': [np.array([2]),np.array([0,2])]})

# Solution: 
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
                             zip(df['data'],df['to_delete'])))

df2['data'] = pd.DataFrame(zip(A))
df2

prints out:

        data to_delete
0  [1, 2, 4]       [2]
1        [1]    [0, 2]

edited Apr 7, 2020 at 21:48

answered Apr 7, 2020 at 21:43

swiss_knight

8,38115 gold badges66 silver badges120 bronze badges

Collectives™ on Stack Overflow

Pandas change values in column based on values in other column

3 Answers 3

5 Comments

1 Comment

An other solution based on starmap:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

An other solution based on starmap:

Comments

Your Answer

Sign up or log in

Post as a guest

Related