1

Say we have this df:

import pandas as pd
df = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color']})

    a
0   hair color other family, friends
1   family, friends hair color

I want to extract strings using my own list of items:

items = ['hair color', 'other', 'family, friends']

I want to do this because there are no consistent delimiter or pattern in the raw data.

Desired output:

import numpy as np
desired_output = pd.DataFrame({'a': ['hair color other family, friends ', 'family, friends hair color'],
                                   'hair color': ['hair color', 'hair color'],
                                   'other': ['other', np.nan],
                                   'family, friends': ['family, friends', 'family, friends']
                                  })


                                  a     hair color  other   family, friends
0   hair color other family, friends    hair color  other   family, friends
1   family, friends hair color          hair color  NaN     family, friends

1 Answer 1

2

You can craft a regex to use with str.extractall:

import re

regex = '|'.join([f'({re.escape(i)})' for i in items])
# '(hair\\ color)|(other)|(family,\\ friends)'

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first())

output:

                                   a  hair color  other  family, friends
0  hair color other family, friends   hair color  other  family, friends
1         family, friends hair color  hair color   None  family, friends

update:

df.join(df['a'].str.extractall(regex)
                   .set_axis(items, axis=1)
                   .groupby(level=0).first()
                   .add_prefix('item1_')
                   .replace({None: np.nan})
       )

output:

                                   a item1_hair color item1_other item1_family, friends
0  hair color other family, friends        hair color       other       family, friends
1         family, friends hair color       hair color         NaN       family, friends
Sign up to request clarification or add additional context in comments.

2 Comments

Amazing. Two questions, if I may: (1) How do I add a prefix to new columns names? For example, item1_hair color, item1_other, etc. I need this because I need to repeat operation for other variables, which have same items, and your solution would result in columns with same names (2) How do I generate np.nan instead of None?
(1) You can use add_prefix('item_') before joining, (2) .replace({None: np.nan})

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.