2

I have some DataFrame:

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)
df

fruit   values
0   apple   apple_1_0,peach_1_5
1   pear    pear_1_3
2   peach   mango_1_0,banana_1_0,pineapple_1_10

The strings in the values column are comma separated, and I'd like the strings that contain the substring '_1_0'.

Desired output:

    fruit   values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

Something like this is somewhat close to what I'm trying to do but is painfully slow over ~100,000 rows:

for row in range(len(df)):
    print([zero for zero in df['values'].str.split(',', expand=False)[row] if "_1_0" in zero])

['apple_1_0']
[]
['mango_1_0', 'banana_1_0']
1
  • You're right, I had forgotten the splitting part. Commented Oct 18, 2020 at 1:41

4 Answers 4

1

Let us try explode

s = df['values'].str.split(',').explode()
df['New_values'] = s.where(s.str.endswith('_1_0')).dropna().groupby(level=0).agg(','.join)
df
Out[29]: 
   fruit                               values            New_values
0  apple                  apple_1_0,peach_1_5             apple_1_0
1   pear                             pear_1_3                   NaN
2  peach  mango_1_0,banana_1_0,pineapple_1_10  mango_1_0,banana_1_0
Sign up to request clarification or add additional context in comments.

Comments

1

Straightforward solution:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'],
     'values': ['apple_1_0,peach_1_5', 'pear_1_3', 'mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

new_data = df['values'].str.split(',')
new_data = new_data.apply(lambda lst: [elem for elem in lst if '_1_0' in elem])
new_data = new_data.str.join(",")
new_data = new_data.replace('', np.NaN)

Comments

1

This is an alternative, as a list comprehension :

    df["values"] = [ ",".join(entry if entry.endswith("1_0") 
                              else "" 
                              for entry in val.split(","))
                       .rstrip(",")
                   for val in df["values"]
                   ]

     df = df.replace({"": np.nan})

    df


   fruit    values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

Comments

1

Using findall you may do this:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(','.join).replace('', np.NaN)    
print ( df )
   fruit                values
0  apple             apple_1_0
1   pear                   NaN
2  peach  mango_1_0,banana_1_0

Regex [^,]*_1_0(?=,|$) matches a non-comma string that ends in _1_0 followed by comma or end of string.


We can use a lambda as well:

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(lambda items: ','.join(items) if len(items) > 0 else np.NaN)

2 Comments

Very nice. Is the word boundary \b necessary in your regex?
I have changed it to [^,]* to make it match hyphenated words as like banana-fruit_1_0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.