6

I have a column of ingredients in a pandas dataframe. I need to remove everything except the name of the ingredient (ex: 1/3 cup cashews > cashews).

Input

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    ⅓ cup cashews
1   Truvani Chocolate Turmeric Caramel Cups    4 dates
2   Truvani Chocolate Turmeric Caramel Cups    1 tablespoon almond butter
3   Truvani Chocolate Turmeric Caramel Cups    3 tablespoons coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    ½ teaspoon vanilla extract

Expected Output

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    cashews
1   Truvani Chocolate Turmeric Caramel Cups    dates
2   Truvani Chocolate Turmeric Caramel Cups    almond butter
3   Truvani Chocolate Turmeric Caramel Cups    coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    vanilla extract 

I've tried using a dictionary, with common words mapped to empty strings like so:

remove_list ={'\d+': '', 'ounces': '', 'ounce': '', 'tablespoons': '', 'tablespoon': '', 'teaspoons': '', 'teaspoon': '', 'cup': '', 'cups': ''}
column = df['ingredient']
column.apply(lambda column: [remove_list[y] if y in remove_list else y for y in column])

This didn't change the data at all.

I've also tried using regex:

df['ingredients'] = re.sub(r'|'.join(map(re.escape, remove_list)), '', df['ingredients'])

But that just gives an error saying "TypeError: expected string or buffer."

I'm very new to Python so I think it's possible with regex, I'm just not sure how to do it.

0

3 Answers 3

8

Since you want to replace everything with the same character, just put them into a list.

l = ['\d+', '[^\x00-\x80]+', 'ounces', 'ounce', 'tablespoons', 
     'tablespoon', 'teaspoons', 'teaspoon', 'cup', 'cups']

Then use one replace, joining everything.

df.ingredient.str.replace('|'.join(l), '', regex=True).str.strip()
# Safer to only replace stand-alone words. strip not needed
#df.ingredient.str.replace('|'.join([x + '\s' for x in l]), '', regex=True)

Output:

0            cashews
1              dates
2      almond butter
3       coconut milk
4    vanilla extract
Name: ingredient, dtype: object

I added '[^\x00-\x80]+' to the list to remove those fractional characters, and the .str.strip removes any excess or leading whitespace after the replacements.

Sign up to request clarification or add additional context in comments.

3 Comments

@Conor Just be careful, because this will accidentally replace something like 5 cupcakes with just cakes. You can get around that by making your replacements '\cup\s' that way it only replaces the word when followed by a space as in 'cup ' and not 'cupc'
Ah thank you, I just checked back because I was having that problem.
@ALollz how can I amend it so that it will only replace the word when followed by nothing, i.e. end of string?
0

There are a set of string functions built into pandas dataframes for this.

Something like this should work:

df['ingredient'] = df['ingredient'].str.replace('\d+', '', regex=True)

I don't know if you can use a dict with it, you may have to iterate through your dictionary to get all of the replacements that you want.

for ptn, rpl in remove_list.items():
    df['ingredient'] = df['ingredient'].str.replace(ptn, rpl, regex=True)

2 Comments

I was trying to do this without using an individual replace statement for every word but that may be the only way
You can use the dict that you've got to iterate through the patterns with just one extra line of code (as above).
-1

You can use a loop and the .split() method:

i = 0
for row in df['ingredient']:
    item = row.split(sep=' ', maxsplit=1)
    df['ingredient'].loc[i] = item[1]
    i += 1

The output will be:

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    cup cashews
1   Truvani Chocolate Turmeric Caramel Cups    dates
2   Truvani Chocolate Turmeric Caramel Cups    tablespoon almond butter
3   Truvani Chocolate Turmeric Caramel Cups    tablespoons coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    teaspoon vanilla extract

If you want to keep the measurements you can create a duplicate column and in one column keep the values and in the other keep the ingredient.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.