Replacing text with dictionary keys (having multiple values) in Python - more efficiency

Question

I have been trying to replace part of the texts in a Pandas dataframe column with keys from a dictionary based on multiple values; though I have achieved the desired result, the process or loop is very very slow in large dataset. I would appreciate it if someone could advise me of a more 'Pythonic' way or more efficient way of achieving the result. Pls see below example:

df =  pd.DataFrame({'Dish':  ['A', 'B','C'],
        'Price': [15,8,20],
         'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
        })

Dish	Price	Ingredient
A	15	apple banana apricot lamb
B	8	wheat pork venison
C	20	orange lamb guinea

The dictionary is below:

CountryList = {'FRUIT': [['apple'], ['orange'],  ['banana']],
 'CEREAL': [['oat'], ['wheat'],  ['corn']],
 'MEAT': [['chicken'],  ['lamb'],  ['pork'],  ['turkey'], ['duck']]}

I am trying to replace text in the 'Ingredient' column with key based on dictionary values. For example, 'apple' in the first row wound be replaced by dictionary key: 'FRUIT'.. The desired table is shown below:

Dish	Price	Ingredient
A	15	FRUIT FRUIT apricot MEAT
B	8	CEREAL MEAT venison
C	20	FRUIT MEAT guinea

I have seen some related queries here where each key has one value; but in this case, there are multiple values for any given key in the dictionary. So far, I have been able to achieve the desired result but it is painfully slow when working with a large dataset. The code I have used so far to achieve the result is shown below:

countries = list(CountryList.keys())

for country in countries:
    for i in range(len(CountryList[country])):
        lender = CountryList[country][i]
        country = str(country)
        lender = str(lender).replace("['",'',).replace("']",'')
        df['Ingredient'] = df['Ingredient'].str.replace(lender,country)

Perhaps this could do with multiprocessing? Needless to say, my knowledge of Python needs a lot to be desired.

Any suggestion to speed up the process would be highly appreciated.

Thanking in advance,

Edit: just to add, some keys have more than 60000 values in the dictionary; and about 200 keys in the dictionary, which is making the code very inefficient time-wise.

The format of CountryList can be changed? Do you really need list of 1 element? — Corralien
– Corralien, Commented Jun 13, 2021 at 14:45
Why name variables CountryList, lender, country when the domain is in ingredients? Makes the code harder to follow. — DarrylG
– DarrylG, Commented Jun 13, 2021 at 14:47

Corralien · Accepted Answer · 2021-06-13 14:54:20Z

4

Change the format of CountryList:

import itertools

CountryList2 = {}
for k, v in CountryList.items():
    for i in (itertools.chain.from_iterable(v)):
        CountryList2[i] = k

>>> CountryList2
{'apple': 'FRUIT',
 'orange': 'FRUIT',
 'banana': 'FRUIT',
 'oat': 'CEREAL',
 'wheat': 'CEREAL',
 'corn': 'CEREAL',
 'chicken': 'MEAT',
 'lamb': 'MEAT',
 'pork': 'MEAT',
 'turkey': 'MEAT',
 'duck': 'MEAT'}

Now you can use replace:

df['Ingredient'] = df['Ingredient'].replace(CountryList2, regex=True)

>>> df
  Dish  Price                 Ingredient
0    A     15   FRUIT FRUIT apricot MEAT
1    B      8        CEREAL MEAT venison
2    C     20          FRUIT MEAT guinea

answered Jun 13, 2021 at 14:54

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bibekb Over a year ago

Thanks for the effort; just tested this and it works, however, for a reduced dataframe of 20000 rows, it took 242 seconds; whereas the initial code I used takes 166 seconds. I have about a million rows to process, hence looking for a more efficient way. Perhaps I should have mentioned that some of the values in the dictionary are more than 60000 for each key.

tdelaney · Accepted Answer · 2021-06-13 15:45:45Z

2

You can build a reverse index of product to type, by creating a dictionary where the keys are the values of the sublists

product_to_type = {}
for typ, product_lists in CountryList.items():
    for product_list in product_lists:
        for product in product_list:
            product_to_type[product] = typ

A little python magic lets you compress this step into a generator that creates the dict

product_to_type = {product:typ for typ, product_lists in CountryList.items()
   for product_list in product_lists for product in product_list}

Then you can create a function that splits the ingredients and maps them to type and apply that to the dataframe.

import pandas as pd

CountryList = {'FRUIT': [['apple'], ['orange'],  ['banana']],
 'CEREAL': [['oat'], ['wheat'],  ['corn']],
 'MEAT': [['chicken'],  ['lamb'],  ['pork'],  ['turkey'], ['duck']]}

product_to_type = {product:typ for typ, product_lists in CountryList.items()
   for product_list in product_lists for product in product_list}

def convert_product_to_type(products):
    return " ".join(product_to_type.get(product, product) 
        for product in products.split(" "))
    
df =  pd.DataFrame({'Dish':  ['A', 'B','C'],
        'Price': [15,8,20],
         'Ingredient': ['apple banana apricot lamb ', 'wheat pork venison', 'orange lamb guinea']
        })

df["Ingredient"] = df["Ingredient"].apply(convert_product_to_type)

print(df)

Note: This solution splits the ingredient list on word boundaries which assumes that ingredients themselves don't have spaces in them.

edited Jun 13, 2021 at 15:45

answered Jun 13, 2021 at 14:58

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

3 Comments

bibekb Over a year ago

Thanks for the effort; this worked like a charm. For the pared data, it brought down the time from 166 secs to 0.10 seconds. Remarkable indeed, perhaps the 'apply' function did the magic. What my code could not do in 8+ hours was done in 13.14 seconds for the full data using your code. Also grateful to you for explaining the first line of code in easy-to-understand way.

Corralien Over a year ago

@realbibek. You should accept the answer of tdelaney, please.

bibekb Over a year ago

just realised and accepted it now. cheers all

ThePyGuy · Accepted Answer · 2021-06-13 15:50:47Z

0

If you want to use regex, just join all the values in the CountryList by pipe | for each of the keys, and then call Series.str.replace for each of the keys, it will be a way faster than the way you are trying.

joined={key: '|'.join(item[0] for item in value) for key,value in CountryList.items()}

for key in joined:
    df['Ingredient'] = df['Ingredient'].str.replace(joined[key], key, regex=True)

OUTPUT:

  Dish  Price                 Ingredient
0    A     15  FRUIT FRUIT apricot MEAT 
1    B      8        CEREAL MEAT venison
2    C     20          FRUIT MEAT guinea

Another approach would be to reverse the key,value in the dictionary, and then to use dict.get for each key with default value as key splitting the words in Ingredient column:

reversedContries={item[0]:key for key,value in CountryList.items() for item in value}

df['Ingredient'].apply(lambda x: ' '.join(reversedContries.get(y,y) for y in x.split()))

edited Jun 13, 2021 at 15:50

answered Jun 13, 2021 at 15:32

ThePyGuy

18.5k5 gold badges24 silver badges55 bronze badges

1 Comment

bibekb Over a year ago

Thanks for the effort; it brought substantial efficiency gains; time was reduced from 166 seconds to 91 seconds for the reduced data. However, the code provided by @tdelaney lowered the time exponentially (from 166 secs to 0.1 sec).

Collectives™ on Stack Overflow

Replacing text with dictionary keys (having multiple values) in Python - more efficiency

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related