10

I have dataframe that has values like those:

  A                    B
["I need avocado"   "something"]
["something"      "I eat margarina"]

And I want to find rows that:

In any column of the row, the column's value is contained in a list. for example, for the list:

["apple","avocado","bannana"]

And only this line should match: ["I need avocado" "something"]

This line doesnt work:

dataFiltered[dataFiltered[col].str.contains(*includeKeywords)]

Returns:

{TypeError}unsupported operand type(s) for &: 'str' and 'int'

What should I do?

1
  • Try .str.contains('|'.join(includeKeywords) Commented May 1, 2019 at 18:51

2 Answers 2

21

Setup

df = pd.DataFrame(dict(
    A=['I need avocado', 'something', 'useless', 'nothing'],
    B=['something', 'I eat margarina', 'eat apple', 'more nothing']
))
includeKeywords = ["apple", "avocado", "bannana"]

Problem

                A                B
0  I need avocado        something  # True 'avocado' in A
1       something  I eat margarina
2         useless        eat apple  # True 'apple' in B
3         nothing     more nothing

Solution


df[df.stack().str.contains('|'.join(includeKeywords)).any(level=0)]

                A          B
0  I need avocado  something
2         useless  eat apple

Details

This produces a regex search string. In regex, '|' means or. So for a regex search, this says match 'apple', 'avocado', or 'bannana'

kwstr = '|'.join(includeKeywords)
print(kwstr)

apple|avocado|bannana

Stacking will flatten our DataFrame

df.stack()

0  A     I need avocado
   B          something
1  A          something
   B    I eat margarina
2  A            useless
   B          eat apple
3  A            nothing
   B       more nothing
dtype: object

Fortunately, the pandas.Series.str.contains method can handle regex and it will produce a boolean Series

df.stack().str.contains(kwstr)

0  A     True
   B    False
1  A    False
   B    False
2  A    False
   B     True
3  A    False
   B    False
dtype: bool

At which point we can cleverly use pandas.Series.any by suggesting it only care about level=0

mask = df.stack().str.contains(kwstr).any(level=0)
mask

0     True
1    False
2     True
3    False
dtype: bool

By using level=0 we preserved the original index in the resulting Series. This makes it perfect for filtering df

df[mask]

                A          B
0  I need avocado  something
2         useless  eat apple
Sign up to request clarification or add additional context in comments.

2 Comments

Wow! short and perfect
when only looking at a single column, a one liner using regex alone is enough: df[df.colname.str.contains('|'.join(includeKeywords))]. Very simple, thanks
4

Take advantage of the any() function and use a list comprenesion in an df.apply()

df = pd.DataFrame(["I need avocado","I eat margarina"])
print(df)
#                  0
# 0   I need avocado
# 1  I eat margarina

includeKeywords = ["apple","avocado","bannana"]

print(df[df.apply(lambda r: any([kw in r[0] for kw in includeKeywords]), axis=1)])
#                 0
# 0  I need avocado

To make this a bit clearer, you basically need to make a mask that returns True/False for each row

mask = [any([kw in r for kw in includeKeywords]) for r in df[0]]
print(mask)

Then you can use that mask to print the selected rows in your DataFrame

# [True, False]
print(df[mask])
#                 0
# 0  I need avocado

I am showing you both ways because while the df.apply() method is handy for a one liner, it is really slow compared to a standard list comprehension. So if you have a small enough set, feel free to use df.apply(). Otherwise, I'd suggest the python comprehension over the pandas method.

3 Comments

Thank you. Your first solution didnt work for me. The second one only works for one column. I want to get all rows that contains the values of the list in any column
I can "join" masks to one mask, but is there a way with pandas?
Well, there are several ways you could do it. You could put another comprehension layer, or make a series of masks in a list of lists than zip that up and do [any(i1, i2, i3,...) for i1, i2, i3 in zip(m1, m2, m3)]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.