3

Let the Pandas Dataframe df below, how can I find the lines with both values 6 and 10?

    0   1   2   3   4   5   6
0   11  1   3   4   6   8   10
1   11  1   3   4   6   8   11
2   11  1   3   4   6   8   0
3   11  1   3   4   6   9   10
4   11  1   3   4   6   9   11
5   11  1   3   4   6   9   0
6   11  1   3   4   6   10  10
7   11  1   3   4   6   10  11
8   11  1   3   4   6   10  0
9   11  1   3   4   7   8   10

I can obtain these lines with a solution based on sets:

>>> df.iloc[[i for i, s in enumerate(df.itertuples()) if {6, 10} <= set(s)]]

    0   1   2   3   4   5   6
0   11  1   3   4   6   8   10
3   11  1   3   4   6   9   10
6   11  1   3   4   6   10  10
7   11  1   3   4   6   10  11
8   11  1   3   4   6   10  0

My question is: Is there a better way in Pandas to get True in the lines where these given values are present? Something such as:

df.where({6, 10} <= df)

The data example:

pandas.DataFrame.from_dict({0: {0: 11, 1: 11, 2: 11, 3: 11, 4: 11, 5: 11, 6: 11, 7: 11, 8: 11, 9: 11},
 1: {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1},
 2: {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3},
 3: {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4, 9: 4},
 4: {0: 6, 1: 6, 2: 6, 3: 6, 4: 6, 5: 6, 6: 6, 7: 6, 8: 6, 9: 7},
 5: {0: 8, 1: 8, 2: 8, 3: 9, 4: 9, 5: 9, 6: 10, 7: 10, 8: 10, 9: 8},
 6: {0: 10, 1: 11, 2: 0, 3: 10, 4: 11, 5: 0, 6: 10, 7: 11, 8: 0, 9: 10}})

Edit

This dataframe is only a short piece of my real data. An integer between 0 and 11 can appear from 0 to 2 times in each line. For instance, in these lines, the values 4, 8 and 11 appear two times each.

        0   1   2   3   4   5   6
100     11  1   4   4   8   8   11
343     11  2   4   4   8   8   11
505     11  3   3   4   8   8   11
586     11  3   4   4   8   8   11
1558    1   1   4   4   8   8   11
4
  • Will your data contain duplicate 6s or 10s? Commented Sep 23, 2015 at 13:05
  • Yes, the data can contain duplicate values. Commented Sep 23, 2015 at 13:07
  • But would you ever have a row with only duplicate 6s/10s? Commented Sep 23, 2015 at 13:08
  • I can't have a row with only duplicate 6s and 10s. Each value can have only one duplication. I edited my post with an example. Commented Sep 23, 2015 at 13:46

1 Answer 1

2

You can use isin to test for membership and then call dropna and pass thresh=2 to show only the rows where at least 2 non-NaN values exist:

In [20]:
df[df.isin([6,10])].dropna(thresh=2)

Out[20]:
    0   1   2   3  4   5   6
0 NaN NaN NaN NaN  6 NaN  10
3 NaN NaN NaN NaN  6 NaN  10
6 NaN NaN NaN NaN  6  10  10
7 NaN NaN NaN NaN  6  10 NaN
8 NaN NaN NaN NaN  6  10 NaN

I think actually it's better to test for each value and apply any:

In [41]:
df.apply(lambda x: (x == 6).any() & (x == 10).any(), axis=1)

Out[41]:
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7     True
8     True
9    False
dtype: bool

For 3 values you can do :

df.apply(lambda x: (x==5).any() & (x == 6).any() & (x == 10).any(), axis=1)
Sign up to request clarification or add additional context in comments.

4 Comments

How could I adapt the code to find a set not in the dataframe? For instance, [5, 6, 10]. I tried df[df.isin([5, 6,10])].dropna(thresh=3) and got the line 6.
You mean values not in 5,6,10?
I want to find out only the rows where the three values (5, 6 and 10) are present.
I think that you can do df.apply(lambda x: (x==5).any() & (x == 6).any() & (x == 10).any(), axis=1)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.