Mapping methods across multiple columns in a Pandas DataFrame

Question

I have a Pandas dataframe where the values are lists:

import pandas as pd

DF = pd.DataFrame({'X':[[1, 5], [1, 2]], 'Y':[[1, 2, 5], [1, 3, 5]]})
DF
         X          Y
0   [1, 5]  [1, 2, 5]
1   [1, 2]  [1, 3, 5]

I want to check if the lists in X are subsets of the lists in Y. With individual lists, we can do this using set(x).issubset(set(y)). But how would we do this across Pandas data columns?

So far, the only thing I've come up with is to use the individual lists as a workaround, then convert the result back to Pandas. Seems a bit complicated for this task:

foo = [set(DF['X'][i]).issubset(set(DF['Y'][i])) for i in range(len(DF['X']))]

foo = pd.DataFrame(foo)
foo.columns = ['x_sub_y']
pd.merge(DF, foo, how = 'inner', left_index = True, right_index = True)

         X          Y   x_sub_y
0   [1, 5]  [1, 2, 5]   True
1   [1, 2]  [1, 3, 5]   False

Is there a easier way to achieve this? Possibly using .map or .apply?

OP, if you are using Scott Boston's answer, I recommend using np.vectorise as well: stackoverflow.com/a/46163829/4909087 — cs95
– cs95, Commented Sep 11, 2017 at 20:42

cs95 · Accepted Answer · 2017-09-11 21:01:26Z

5

Option 1
set conversion and difference using np.where

df_temp = DF.applymap(set)
DF['x_sub_y'] = np.where(df_temp.X - df_temp.Y, False, True)
DF
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Option 2
Faster, astype conversion

DF['x_sub_y'] = ~(DF.X.apply(set) - DF.Y.apply(set)).astype(bool)
DF 
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Option 3
Fun with np.vectorize

def foo(x):
     return not x

v = np.vectorize(foo)    
DF['x_sub_y'] = v(DF.X.apply(set) - DF.Y.apply(set)) 
DF
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Extending Scott Boston's answer for speed using the same approach:

def foo(x, y):
    return set(x).issubset(y)

v = np.vectorize(foo)

DF['x_sub_y'] = v(DF.X, DF.Y)
DF
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Small

1000 loops, best of 3: 460 µs per loop           # Before       
10000 loops, best of 3: 103 µs per loop          # After

Large (`df 10000`*)

1 loop, best of 3: 1.26 s per loop               # Before   
100 loops, best of 3: 13.3 ms per loop           # After

edited Sep 11, 2017 at 21:01

answered Sep 11, 2017 at 20:24

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

MaxU - stand with Ukraine Over a year ago

does it make sense to "timeit" against a two-rows DF?

cs95 Over a year ago

@MaxU If the difference is so vast for a couple of rows, then there may be no need to time for a larger one... but I can do it :)

MaxU - stand with Ukraine Over a year ago

a timing for a larger data set might (not necessarily) show quite different results...

cs95 Over a year ago

@MaxU Thanks for asking... appreciating np.vectorise even more now :-) (why haven't I used it before? x))

Scott Boston · Accepted Answer · 2017-09-11 20:25:19Z

4

Use set and issubset:

DF.assign(x_sub_y = DF.apply(lambda x: set(x.X).issubset(set(x.Y)), axis=1))

Output:

        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

answered Sep 11, 2017 at 20:25

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

2 Comments

cs95 Over a year ago

If I vectorise your answer, I get a 4x speedup on this tiny dataset.

cs95 Over a year ago

Btw, you don't need t convert x.Y to a set.

BENY · Accepted Answer · 2017-09-11 20:34:50Z

2

Or you can try set

DF['x_sub_y']=DF.X+DF.Y
DF['x_sub_y']=DF['x_sub_y'].apply(lambda x : list(set(x)))==DF.Y
DF
Out[691]: 
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

answered Sep 11, 2017 at 20:34

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

Mapping methods across multiple columns in a Pandas DataFrame

3 Answers 3

Small

Large (`df 10000`*)

4 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Small

Large (df * 10000)

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Large (`df 10000`*)