Extract strings based on custom list of items

Question

Say we have this df:

import pandas as pd
df = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color']})

    a
0   hair color other family, friends
1   family, friends hair color

I want to extract strings using my own list of items:

items = ['hair color', 'other', 'family, friends']

I want to do this because there are no consistent delimiter or pattern in the raw data.

Desired output:

import numpy as np
desired_output = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color'],
                                   'hair color': ['hair color', 'hair color'],
                                   'other': ['other', np.nan],
                                   'family, friends': ['family, friends', 'family, friends']
                                  })


                                  a     hair color  other   family, friends
0   hair color other family, friends    hair color  other   family, friends
1   family, friends hair color          hair color  NaN     family, friends

mozway · Accepted Answer · 2022-09-21 11:33:12Z

2

You can craft a regex to use with str.extractall:

import re

regex = '|'.join([f'({re.escape(i)})' for i in items])
# '(hair\\ color)|(other)|(family,\\ friends)'

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first())

output:

                                   a  hair color  other  family, friends
0  hair color other family, friends   hair color  other  family, friends
1         family, friends hair color  hair color   None  family, friends

update:

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first()
                   .add_prefix('item1_')
                   .replace({None: np.nan})
       )

output:

                                   a item1_hair color item1_other item1_family, friends
0  hair color other family, friends        hair color       other       family, friends
1         family, friends hair color       hair color         NaN       family, friends

edited Sep 21, 2022 at 11:33

answered Sep 21, 2022 at 11:08

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

johnjohn Over a year ago

Amazing. Two questions, if I may: (1) How do I add a prefix to new columns names? For example, item1_hair color, item1_other, etc. I need this because I need to repeat operation for other variables, which have same items, and your solution would result in columns with same names (2) How do I generate np.nan instead of None?

mozway Over a year ago

(1) You can use add_prefix('item_') before joining, (2) .replace({None: np.nan})

Collectives™ on Stack Overflow

Extract strings based on custom list of items

1 Answer 1

update:

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

update:

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related