6

I would like to select those rows in a column that contains any of the substrings in a list. This is what I have for now.

product = ['LID', 'TABLEWARE', 'CUP', 'COVER', 'CONTAINER', 'PACKAGING']

df_plastic_prod = df_plastic[df_plastic['Goods Shipped'].str.contains(product)]

df_plastic_prod.info()

Sample df_plastic

Name          Product
David        PLASTIC BOTTLE
Meghan       PLASTIC COVER
Melanie      PLASTIC CUP 
Aaron        PLASTIC BOWL
Venus        PLASTIC KNIFE
Abigail      PLASTIC CONTAINER
Sophia       PLASTIC LID

Desired df_plastic_prod

Name          Product
Meghan       PLASTIC COVER
Melanie      PLASTIC CUP 
Abigail      PLASTIC CONTAINER
Sophia       PLASTIC LID

Thanks in advance! I appreciate any assistance on this!

2 Answers 2

6

For match values by subtrings join all values of list by | for regex or - so get values LID or TABLEWARE ...:

Solution working well also with 2 or more words in list.

pat = '|'.join(r"\b{}\b".format(x) for x in product)
df_plastic_prod = df_plastic[df_plastic['Product'].str.contains(pat)]
print (df_plastic_prod)
      Name            Product
1   Meghan      PLASTIC COVER
2  Melanie        PLASTIC CUP
5  Abigail  PLASTIC CONTAINER
6   Sophia        PLASTIC LID
Sign up to request clarification or add additional context in comments.

3 Comments

Question: Why we need to have \b in pat?
@balderman - It is called word boundaries, it is necessary for match by all word/s - avoid match cat in words bobcat is nice, maching cat is nice
I see. So pat is actually a regexp and the term 'word boundaries' is taken from the regexp domain. Thanks.
0

One solution is using regex to parse the 'Product' column, and test if any of the extracted values are in the product list, then filter the original DataFrame on the results.

In this case, a very simple regex pattern is used ((\w+)$) which matches a single word at the end of a line.

Sample code:

df.iloc[df['Product'].str.extract('(\w+)$').isin(product).to_numpy(), :]

Output:

      Name            Product
1   Meghan      PLASTIC COVER
2  Melanie        PLASTIC CUP
5  Abigail  PLASTIC CONTAINER
6   Sophia        PLASTIC LID

Setup:

product = ['LID', 'TABLEWARE', 'CUP', 
           'COVER', 'CONTAINER', 'PACKAGING']

data = {'Name': ['David', 'Meghan', 'Melanie', 
                 'Aaron', 'Venus', 'Abigail', 'Sophia'],
        'Product': ['PLASTIC BOTTLE', 'PLASTIC COVER', 'PLASTIC CUP', 
                    'PLASTIC BOWL', 'PLASTIC KNIFE', 'PLASTIC CONTAINER',
                    'PLASTIC LID']}
    
df = pd.DataFrame(data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.