Select a partial string from a list within a column in Pandas DataFrame

Question

I have some DataFrame:

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)
df

fruit   values
0   apple   apple_1_0,peach_1_5
1   pear    pear_1_3
2   peach   mango_1_0,banana_1_0,pineapple_1_10

The strings in the values column are comma separated, and I'd like the strings that contain the substring '_1_0'.

Desired output:

    fruit   values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

Something like this is somewhat close to what I'm trying to do but is painfully slow over ~100,000 rows:

for row in range(len(df)):
    print([zero for zero in df['values'].str.split(',', expand=False)[row] if "_1_0" in zero])

['apple_1_0']
[]
['mango_1_0', 'banana_1_0']

You're right, I had forgotten the splitting part.

AMC
– AMC

2020-10-18 01:41:44 +00:00
Commented Oct 18, 2020 at 1:41 — AMC
– AMC, Commented Oct 18, 2020 at 1:41

BENY · Accepted Answer · 2020-10-18 01:41:14Z

1

Let us try explode

s = df['values'].str.split(',').explode()
df['New_values'] = s.where(s.str.endswith('_1_0')).dropna().groupby(level=0).agg(','.join)
df
Out[29]: 
   fruit                               values            New_values
0  apple                  apple_1_0,peach_1_5             apple_1_0
1   pear                             pear_1_3                   NaN
2  peach  mango_1_0,banana_1_0,pineapple_1_10  mango_1_0,banana_1_0

answered Oct 18, 2020 at 1:41

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

AMC · Accepted Answer · 2020-10-18 01:55:41Z

1

Straightforward solution:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'],
     'values': ['apple_1_0,peach_1_5', 'pear_1_3', 'mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

new_data = df['values'].str.split(',')
new_data = new_data.apply(lambda lst: [elem for elem in lst if '_1_0' in elem])
new_data = new_data.str.join(",")
new_data = new_data.replace('', np.NaN)

edited Oct 18, 2020 at 1:55

answered Oct 18, 2020 at 1:50

AMC

2,6977 gold badges15 silver badges35 bronze badges

Comments

sammywemmy · Accepted Answer · 2020-10-18 03:15:15Z

1

This is an alternative, as a list comprehension :

    df["values"] = [ ",".join(entry if entry.endswith("1_0") 
                              else "" 
                              for entry in val.split(","))
                       .rstrip(",")
                   for val in df["values"]
                   ]

     df = df.replace({"": np.nan})

    df


   fruit    values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

answered Oct 18, 2020 at 3:15

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Comments

anubhava · Accepted Answer · 2020-10-18 09:31:29Z

1

Using findall you may do this:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(','.join).replace('', np.NaN)    
print ( df )

   fruit                values
0  apple             apple_1_0
1   pear                   NaN
2  peach  mango_1_0,banana_1_0

Regex [^,]*_1_0(?=,|$) matches a non-comma string that ends in _1_0 followed by comma or end of string.

We can use a lambda as well:

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(lambda items: ','.join(items) if len(items) > 0 else np.NaN)

edited Oct 18, 2020 at 9:31

answered Oct 18, 2020 at 8:00

anubhava

790k67 gold badges603 silver badges671 bronze badges

2 Comments

Cactus Philosopher Over a year ago

Very nice. Is the word boundary \b necessary in your regex?

anubhava Over a year ago

I have changed it to [^,]* to make it match hyphenated words as like banana-fruit_1_0

Collectives™ on Stack Overflow

Select a partial string from a list within a column in Pandas DataFrame

4 Answers 4

Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related