Python Pandas - check for substring containment and set new column to substring

Question

I need to check for string containment and set the new column to the substring value. I am currently trying this

df['NEW_COL'] = df['COL_TO_CHECK'].str.contains('|'.join(substring_list))

instead of returning the boolean true false for containment... I need to return the actual value from substring_list that matches to populate df['NEW_COL]

SUBSTRINGS TO CHECK FOR

substring_list = ['apple', 'banana', 'cherry']

RESULTING DATAFRAME

OLD_COL              NEW_COL
apple pie            apple
black cherry         cherry
banana lemon drop    banana

Please share and example of input and output...

zipa
– zipa

2017-05-05 10:44:25 +00:00
Commented May 5, 2017 at 10:44 — zipa
– zipa, Commented May 5, 2017 at 10:44

zipa · Accepted Answer · 2017-05-05 10:53:20Z

3

You are not being very insightful regarding what is your data and what you want, but the general principle is that you can use:

df['NEW_COL'] = df['COL_TO_CHECK'].apply(lambda x: do_something(x) if is_something(x) else x)

Or in your example:

substring_list = set(['apple', 'banana', 'cherry'])
df['NEW_COL'] = df['OLD_COL'].apply(lambda x: set(x.split()).intersection(substring_list).pop())

set is faster :)

edited May 5, 2017 at 10:53

answered May 5, 2017 at 10:47

zipa

28k6 gold badges45 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Aran Freel Over a year ago

This solution works but I don't believe it is the best way to do it, create a list for every line doesn't seem efficient. df['COL_TO_CHECK].apply(lambda x: [s for s in substring_list if s in x][0])

Aran Freel Over a year ago

set is the best for containment checks, I knew that, duh :)

Aran Freel Over a year ago

I think that my solution is more flexible

zipa Over a year ago

@MaxU No problem :) +1 back at you

MaxU - stand with Ukraine · Accepted Answer · 2017-05-05 11:47:56Z

2

I'd do it this way:

In [148]: df
Out[148]:
             OLD_COL
0          apple pie
1       black cherry
2  banana lemon drop

In [149]: pat = '.*({}).*'.format('|'.join(substring_list))

In [150]: pat
Out[150]: '.*(apple|banana|cherry).*'

In [151]: df['NEW_COL'] = df['OLD_COL'].str.replace(pat, r'\1')

In [152]: df
Out[152]:
             OLD_COL NEW_COL
0          apple pie   apple
1       black cherry  cherry
2  banana lemon drop  banana

answered May 5, 2017 at 11:47

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

8 Comments

Aran Freel Over a year ago

what is the r'\1'?

MaxU - stand with Ukraine Over a year ago

r'\1' is the first captured RegEx group

Aran Freel Over a year ago

what is your opinion of the method I used above? df['COL_TO_CHECK].apply(lambda x: [s for s in substring_list if s in x][0])

MaxU - stand with Ukraine Over a year ago

@AranFreel, oh, i didn't notice that solution it in the comment. It looks quite good to me...

MaxU - stand with Ukraine Over a year ago

@AranFreel, there are better options if you are searching for words, not for substrings

|

Collectives™ on Stack Overflow

Python Pandas - check for substring containment and set new column to substring

SUBSTRINGS TO CHECK FOR

RESULTING DATAFRAME

2 Answers 2

4 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

SUBSTRINGS TO CHECK FOR

RESULTING DATAFRAME

2 Answers 2

4 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related