3

I have a dataframe in which one column represents some data, the other column represents indices on which I want to delete from my data. So starting from this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data       to_delete
     [1,2,3,4]    [2]
     [0,1,2]     [0,2]

This is what I want to end up with:

new_df
>>>>   data     to_delete
     [1,2,4]       [2]
       [1]        [0,2]

I could iterate over the rows by hand and calculate the new data for each one like this:

new_data = []
for _,v in df.iterrows():
    foo = np.delete(v['data'],v['to_delete'])
    new_data.append(foo)
df.assign(data=new_data)

but I'm looking for a better way to do this.

4
  • 1
    Actually your iterative solution is best I can think of. Commented Apr 7, 2020 at 21:10
  • Do you really have numpy arrays? Or lists? Commented Apr 7, 2020 at 21:15
  • I do have numpy arrays. But would the methods differ that much if I had lists? Commented Apr 7, 2020 at 21:16
  • iterrows() is rather sluggish. Why not use apply() ? Commented Apr 7, 2020 at 21:24

3 Answers 3

2

The overhead from calling a numpy function for each row will really worsen the performance here. I'd suggest you to go with lists instead:

df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]] 
              for i in df.values]

print(df)

       data to_delete
0  [1, 2, 4]       [2]
1        [1]    [0, 2]

Timings on a 20K row dataframe:

df_large = pd.concat([df]*10000, axis=0)

%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]] 
         for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit 
new_data = []
for _,v in df_large.iterrows():
    foo = np.delete(v['data'],v['to_delete'])
    new_data.append(foo)

# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_large.apply(lambda row: np.delete(row["data"], 
                       row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

5 Comments

this is pretty great insight, thank you. The numpy arrays are coming from further down in my actual pipeline and are used further up as well. Not sure how conversions would affect execution time.
Actually for this approach dealing with lists or arrays doesn't really change anything. As mentioned though, calling a numpy function on each row is a really bad idea, and will result in a poor performance. Check the timings on a not so large 20K row df @emilaz
It does for my output, or rather the data type of my new data column values. But that is easily fixed by using [np.array([j for ix, j in enumerate(i[0]) if ix not in i[1]]) for i in df_large.values]. I will use your approach, thanks again!
Yes, exactly in that case just construct an array from the list, hope this helps @emilaz
Also I'll add that having a dataframe of np.arrays is not a good idea at all, performance-wise. If you can, stick to either lists, or dataframes. Numpy arrays are not useful either when they are not homogeneous (different number of rows items per row) @emilaz
1

You should use the apply function in order to apply a function to every row in the dataframe:

df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)

1 Comment

That is throwing a KeyError for me: KeyError: 'data'
0

An other solution based on starmap:

This solution is based on a less known tool from the itertools module called starmap.

Check its doc, it's worth a try!

import pandas as pd
import numpy as np
from itertools import starmap

df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
                   'to_delete': [np.array([2]),np.array([0,2])]})

# Solution: 
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
                             zip(df['data'],df['to_delete'])))

df2['data'] = pd.DataFrame(zip(A))
df2

prints out:

        data to_delete
0  [1, 2, 4]       [2]
1        [1]    [0, 2]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.