Selecting Pandas Dataframe by set values

Question

Let the Pandas Dataframe df below, how can I find the lines with both values 6 and 10?

    0   1   2   3   4   5   6
0   11  1   3   4   6   8   10
1   11  1   3   4   6   8   11
2   11  1   3   4   6   8   0
3   11  1   3   4   6   9   10
4   11  1   3   4   6   9   11
5   11  1   3   4   6   9   0
6   11  1   3   4   6   10  10
7   11  1   3   4   6   10  11
8   11  1   3   4   6   10  0
9   11  1   3   4   7   8   10

I can obtain these lines with a solution based on sets:

>>> df.iloc[[i for i, s in enumerate(df.itertuples()) if {6, 10} <= set(s)]]

    0   1   2   3   4   5   6
0   11  1   3   4   6   8   10
3   11  1   3   4   6   9   10
6   11  1   3   4   6   10  10
7   11  1   3   4   6   10  11
8   11  1   3   4   6   10  0

My question is: Is there a better way in Pandas to get True in the lines where these given values are present? Something such as:

df.where({6, 10} <= df)

The data example:

pandas.DataFrame.from_dict({0: {0: 11, 1: 11, 2: 11, 3: 11, 4: 11, 5: 11, 6: 11, 7: 11, 8: 11, 9: 11},
 1: {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1},
 2: {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3},
 3: {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4, 9: 4},
 4: {0: 6, 1: 6, 2: 6, 3: 6, 4: 6, 5: 6, 6: 6, 7: 6, 8: 6, 9: 7},
 5: {0: 8, 1: 8, 2: 8, 3: 9, 4: 9, 5: 9, 6: 10, 7: 10, 8: 10, 9: 8},
 6: {0: 10, 1: 11, 2: 0, 3: 10, 4: 11, 5: 0, 6: 10, 7: 11, 8: 0, 9: 10}})

Edit

This dataframe is only a short piece of my real data. An integer between 0 and 11 can appear from 0 to 2 times in each line. For instance, in these lines, the values 4, 8 and 11 appear two times each.

        0   1   2   3   4   5   6
100     11  1   4   4   8   8   11
343     11  2   4   4   8   8   11
505     11  3   3   4   8   8   11
586     11  3   4   4   8   8   11
1558    1   1   4   4   8   8   11

I can't have a row with only duplicate 6s and 10s. Each value can have only one duplication. I edited my post with an example. — msampaio
– msampaio, Commented Sep 23, 2015 at 13:46

EdChum · Accepted Answer · 2015-09-23 13:52:23Z

2

You can use isin to test for membership and then call dropna and pass thresh=2 to show only the rows where at least 2 non-NaN values exist:

In [20]:
df[df.isin([6,10])].dropna(thresh=2)

Out[20]:
    0   1   2   3  4   5   6
0 NaN NaN NaN NaN  6 NaN  10
3 NaN NaN NaN NaN  6 NaN  10
6 NaN NaN NaN NaN  6  10  10
7 NaN NaN NaN NaN  6  10 NaN
8 NaN NaN NaN NaN  6  10 NaN

I think actually it's better to test for each value and apply any:

In [41]:
df.apply(lambda x: (x == 6).any() & (x == 10).any(), axis=1)

Out[41]:
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7     True
8     True
9    False
dtype: bool

For 3 values you can do :

df.apply(lambda x: (x==5).any() & (x == 6).any() & (x == 10).any(), axis=1)

edited Sep 23, 2015 at 13:52

answered Sep 23, 2015 at 12:36

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

msampaio Over a year ago

How could I adapt the code to find a set not in the dataframe? For instance, [5, 6, 10]. I tried df[df.isin([5, 6,10])].dropna(thresh=3) and got the line 6.

EdChum Over a year ago

You mean values not in 5,6,10?

msampaio Over a year ago

I want to find out only the rows where the three values (5, 6 and 10) are present.

EdChum Over a year ago

I think that you can do df.apply(lambda x: (x==5).any() & (x == 6).any() & (x == 10).any(), axis=1)

Collectives™ on Stack Overflow

Selecting Pandas Dataframe by set values

Edit

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Edit

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related