1

I have a Pandas DataFrame "table" that contains a column called "OPINION", filled with string values. I would like to create a new column called "cond5" that is filled with TRUE for every row where "OPINION" is either "buy" or "neutral".

I have tried

table["cond5"]= table.OPINION == "buy" or table.OPINION == "neutral"

which gives me an error, and

table["cond5"]= table.OPINION.all() in ("buy", "neutral")

which returns FALSE for all rows.

1 Answer 1

1

And as Ed Chum points out, you could use the isin method:

table['cond5'] = table['OPINION'].isin(['buy', 'neutral'])

isin checks for exact equality. Perhaps that would be easiest and most readable.


To fix

table["cond5"] = table.OPINION == "buy" or table.OPINION == "neutral"

use

table["cond5"] = (table['OPINION'] == "buy") | (table['OPINION'] == "neutral")

The parentheses are necessary because | has higher precedence (binding power) than ==.

x or y requires x and y to be booleans.

(table['OPINION'] == "buy") or (table['OPINION'] == "neutral")

raises an error since Series can no be reduced to a single boolean value.

So instead use the logical-or operator |, which takes the or of the values in the Series element-wise.


Another alternative is

import numpy as np
table["cond5"] = np.logical_or.reduce([(table['OPINION'] == val) for val in ('buy', 'neutral')])

which might be useful if ('buy', 'neutral') were a longer tuple.


Yet another option is to use Pandas' vectorized string method, str.contains:

table["cond5"] = table['OPINION'].str.contains(r'buy|neutral')

str.contains performs a regex search for the pattern r'buy|neutral' in a Cythonized loop for each item in table['OPINION'].


Now how to decide which one to use? Here is a timeit benchmark using IPython:

In [10]: table = pd.DataFrame({'OPINION':np.random.choice(['buy','neutral','sell',''], size=10**6)})

In [11]: %timeit (table['OPINION'] == "buy") | (table['OPINION'] == "neutral")
10 loops, best of 3: 121 ms per loop

In [12]: %timeit np.logical_or.reduce([(table['OPINION'] == val) for val in ('buy', 'neutral')])
1 loops, best of 3: 204 ms per loop

In [13]: %timeit table['OPINION'].str.contains(r'buy|neutral')
1 loops, best of 3: 474 ms per loop

In [14]: %timeit table['OPINION'].isin(['buy', 'neutral'])
10 loops, best of 3: 40 ms per loop

So it looks like isin is fastest.

Sign up to request clarification or add additional context in comments.

1 Comment

Another method is table['OPINION'].isin(['buy', 'neutral'])

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.