2

I need to check for string containment and set the new column to the substring value. I am currently trying this

df['NEW_COL'] = df['COL_TO_CHECK'].str.contains('|'.join(substring_list))

instead of returning the boolean true false for containment... I need to return the actual value from substring_list that matches to populate df['NEW_COL]

SUBSTRINGS TO CHECK FOR

substring_list = ['apple', 'banana', 'cherry']

RESULTING DATAFRAME

OLD_COL              NEW_COL
apple pie            apple
black cherry         cherry
banana lemon drop    banana
1
  • Please share and example of input and output... Commented May 5, 2017 at 10:44

2 Answers 2

3

You are not being very insightful regarding what is your data and what you want, but the general principle is that you can use:

df['NEW_COL'] = df['COL_TO_CHECK'].apply(lambda x: do_something(x) if is_something(x) else x)

Or in your example:

substring_list = set(['apple', 'banana', 'cherry'])
df['NEW_COL'] = df['OLD_COL'].apply(lambda x: set(x.split()).intersection(substring_list).pop())

set is faster :)

Sign up to request clarification or add additional context in comments.

4 Comments

This solution works but I don't believe it is the best way to do it, create a list for every line doesn't seem efficient. df['COL_TO_CHECK].apply(lambda x: [s for s in substring_list if s in x][0])
set is the best for containment checks, I knew that, duh :)
I think that my solution is more flexible
@MaxU No problem :) +1 back at you
2

I'd do it this way:

In [148]: df
Out[148]:
             OLD_COL
0          apple pie
1       black cherry
2  banana lemon drop

In [149]: pat = '.*({}).*'.format('|'.join(substring_list))

In [150]: pat
Out[150]: '.*(apple|banana|cherry).*'

In [151]: df['NEW_COL'] = df['OLD_COL'].str.replace(pat, r'\1')

In [152]: df
Out[152]:
             OLD_COL NEW_COL
0          apple pie   apple
1       black cherry  cherry
2  banana lemon drop  banana

8 Comments

what is the r'\1'?
r'\1' is the first captured RegEx group
what is your opinion of the method I used above? df['COL_TO_CHECK].apply(lambda x: [s for s in substring_list if s in x][0])
@AranFreel, oh, i didn't notice that solution it in the comment. It looks quite good to me...
@AranFreel, there are better options if you are searching for words, not for substrings
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.