1

I have a Pandas dataframe where the values are lists:

import pandas as pd

DF = pd.DataFrame({'X':[[1, 5], [1, 2]], 'Y':[[1, 2, 5], [1, 3, 5]]})
DF
         X          Y
0   [1, 5]  [1, 2, 5]
1   [1, 2]  [1, 3, 5]

I want to check if the lists in X are subsets of the lists in Y. With individual lists, we can do this using set(x).issubset(set(y)). But how would we do this across Pandas data columns?

So far, the only thing I've come up with is to use the individual lists as a workaround, then convert the result back to Pandas. Seems a bit complicated for this task:

foo = [set(DF['X'][i]).issubset(set(DF['Y'][i])) for i in range(len(DF['X']))]

foo = pd.DataFrame(foo)
foo.columns = ['x_sub_y']
pd.merge(DF, foo, how = 'inner', left_index = True, right_index = True)

         X          Y   x_sub_y
0   [1, 5]  [1, 2, 5]   True
1   [1, 2]  [1, 3, 5]   False

Is there a easier way to achieve this? Possibly using .map or .apply?

1

3 Answers 3

5

Option 1
set conversion and difference using np.where

df_temp = DF.applymap(set)
DF['x_sub_y'] = np.where(df_temp.X - df_temp.Y, False, True)
DF
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Option 2
Faster, astype conversion

DF['x_sub_y'] = ~(DF.X.apply(set) - DF.Y.apply(set)).astype(bool)
DF 
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Option 3
Fun with np.vectorize

def foo(x):
     return not x

v = np.vectorize(foo)    
DF['x_sub_y'] = v(DF.X.apply(set) - DF.Y.apply(set)) 
DF
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Extending Scott Boston's answer for speed using the same approach:

def foo(x, y):
    return set(x).issubset(y)

v = np.vectorize(foo)

DF['x_sub_y'] = v(DF.X, DF.Y)
DF
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Small

1000 loops, best of 3: 460 µs per loop           # Before       
10000 loops, best of 3: 103 µs per loop          # After

Large (df * 10000)

1 loop, best of 3: 1.26 s per loop               # Before   
100 loops, best of 3: 13.3 ms per loop           # After
Sign up to request clarification or add additional context in comments.

4 Comments

does it make sense to "timeit" against a two-rows DF?
@MaxU If the difference is so vast for a couple of rows, then there may be no need to time for a larger one... but I can do it :)
a timing for a larger data set might (not necessarily) show quite different results...
@MaxU Thanks for asking... appreciating np.vectorise even more now :-) (why haven't I used it before? x))
4

Use set and issubset:

DF.assign(x_sub_y = DF.apply(lambda x: set(x.X).issubset(set(x.Y)), axis=1))

Output:

        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

2 Comments

If I vectorise your answer, I get a 4x speedup on this tiny dataset.
Btw, you don't need t convert x.Y to a set.
2

Or you can try set

DF['x_sub_y']=DF.X+DF.Y
DF['x_sub_y']=DF['x_sub_y'].apply(lambda x : list(set(x)))==DF.Y
DF
Out[691]: 
        X          Y  x_sub_y
0  [1, 5]  [1, 2, 5]     True
1  [1, 2]  [1, 3, 5]    False

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.