Evaluate Pandas DataFrame against two string values

Question

I have a Pandas DataFrame "table" that contains a column called "OPINION", filled with string values. I would like to create a new column called "cond5" that is filled with TRUE for every row where "OPINION" is either "buy" or "neutral".

I have tried

table["cond5"]= table.OPINION == "buy" or table.OPINION == "neutral"

which gives me an error, and

table["cond5"]= table.OPINION.all() in ("buy", "neutral")

which returns FALSE for all rows.

unutbu · Accepted Answer · 2015-02-24 14:26:26Z

And as Ed Chum points out, you could use the isin method:

table['cond5'] = table['OPINION'].isin(['buy', 'neutral'])

isin checks for exact equality. Perhaps that would be easiest and most readable.

To fix

table["cond5"] = table.OPINION == "buy" or table.OPINION == "neutral"

use

table["cond5"] = (table['OPINION'] == "buy") | (table['OPINION'] == "neutral")

The parentheses are necessary because | has higher precedence (binding power) than ==.

x or y requires x and y to be booleans.

(table['OPINION'] == "buy") or (table['OPINION'] == "neutral")

raises an error since Series can no be reduced to a single boolean value.

So instead use the logical-or operator |, which takes the or of the values in the Series element-wise.

Another alternative is

import numpy as np
table["cond5"] = np.logical_or.reduce([(table['OPINION'] == val) for val in ('buy', 'neutral')])

which might be useful if ('buy', 'neutral') were a longer tuple.

Yet another option is to use Pandas' vectorized string method, str.contains:

table["cond5"] = table['OPINION'].str.contains(r'buy|neutral')

str.contains performs a regex search for the pattern r'buy|neutral' in a Cythonized loop for each item in table['OPINION'].

Now how to decide which one to use? Here is a timeit benchmark using IPython:

In [10]: table = pd.DataFrame({'OPINION':np.random.choice(['buy','neutral','sell',''], size=10**6)})

In [11]: %timeit (table['OPINION'] == "buy") | (table['OPINION'] == "neutral")
10 loops, best of 3: 121 ms per loop

In [12]: %timeit np.logical_or.reduce([(table['OPINION'] == val) for val in ('buy', 'neutral')])
1 loops, best of 3: 204 ms per loop

In [13]: %timeit table['OPINION'].str.contains(r'buy|neutral')
1 loops, best of 3: 474 ms per loop

In [14]: %timeit table['OPINION'].isin(['buy', 'neutral'])
10 loops, best of 3: 40 ms per loop

So it looks like isin is fastest.

Another method is table['OPINION'].isin(['buy', 'neutral'])

Collectives™ on Stack Overflow

Evaluate Pandas DataFrame against two string values

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related