Pandas: Select rows that contain any substring from a list

Question

I would like to select those rows in a column that contains any of the substrings in a list. This is what I have for now.

product = ['LID', 'TABLEWARE', 'CUP', 'COVER', 'CONTAINER', 'PACKAGING']

df_plastic_prod = df_plastic[df_plastic['Goods Shipped'].str.contains(product)]

df_plastic_prod.info()

Sample df_plastic

Name          Product
David        PLASTIC BOTTLE
Meghan       PLASTIC COVER
Melanie      PLASTIC CUP 
Aaron        PLASTIC BOWL
Venus        PLASTIC KNIFE
Abigail      PLASTIC CONTAINER
Sophia       PLASTIC LID

Desired df_plastic_prod

Name          Product
Meghan       PLASTIC COVER
Melanie      PLASTIC CUP 
Abigail      PLASTIC CONTAINER
Sophia       PLASTIC LID

Thanks in advance! I appreciate any assistance on this!

jezrael · Accepted Answer · 2020-10-19 08:42:40Z

6

For match values by subtrings join all values of list by | for regex or - so get values LID or TABLEWARE ...:

Solution working well also with 2 or more words in list.

pat = '|'.join(r"\b{}\b".format(x) for x in product)
df_plastic_prod = df_plastic[df_plastic['Product'].str.contains(pat)]
print (df_plastic_prod)
      Name            Product
1   Meghan      PLASTIC COVER
2  Melanie        PLASTIC CUP
5  Abigail  PLASTIC CONTAINER
6   Sophia        PLASTIC LID

edited Oct 19, 2020 at 8:42

answered Oct 19, 2020 at 8:33

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

balderman Over a year ago

Question: Why we need to have \b in pat?

jezrael Over a year ago

@balderman - It is called word boundaries, it is necessary for match by all word/s - avoid match cat in words bobcat is nice, maching cat is nice

balderman Over a year ago

I see. So pat is actually a regexp and the term 'word boundaries' is taken from the regexp domain. Thanks.

s3dev · Accepted Answer · 2020-10-19 08:30:52Z

One solution is using regex to parse the 'Product' column, and test if any of the extracted values are in the product list, then filter the original DataFrame on the results.

In this case, a very simple regex pattern is used ((\w+)$) which matches a single word at the end of a line.

Sample code:

df.iloc[df['Product'].str.extract('(\w+)$').isin(product).to_numpy(), :]

Output:

      Name            Product
1   Meghan      PLASTIC COVER
2  Melanie        PLASTIC CUP
5  Abigail  PLASTIC CONTAINER
6   Sophia        PLASTIC LID

Setup:

product = ['LID', 'TABLEWARE', 'CUP', 
           'COVER', 'CONTAINER', 'PACKAGING']

data = {'Name': ['David', 'Meghan', 'Melanie', 
                 'Aaron', 'Venus', 'Abigail', 'Sophia'],
        'Product': ['PLASTIC BOTTLE', 'PLASTIC COVER', 'PLASTIC CUP', 
                    'PLASTIC BOWL', 'PLASTIC KNIFE', 'PLASTIC CONTAINER',
                    'PLASTIC LID']}
    
df = pd.DataFrame(data)

Collectives™ on Stack Overflow

Pandas: Select rows that contain any substring from a list

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related