1

I have one list of string and one dictionary. For eg:

list = ["apple fell on Newton", "lemon is yellow","grass is greener"]
dict = {"apple" : "fruits", "lemon" : "vegetable"}

Task is to match each string from list with the key of dictionary. If it matches then return the value of the key.

Currently, I am using this approach which is very time consuming. Can someone please help me out with any efficient technique ?

lmb_extract_type = (lambda post: list(filter(None, set(dict.get(w)[0] if w in post.lower().split() else None for w in dict))))

 df['type']  = df[list].apply(lmb_extract_type)
9
  • 2
    I am not sure what the df here is, but for the 2 inputs that you have provided, do check my answer. On a side note, try not using list and dict as variable names specially when you are also using list() or dict() for data type conversion :) Commented Feb 5, 2021 at 7:48
  • Number of elements in the list is around 40-50 million.So, its taking a lot of time Commented Feb 5, 2021 at 7:49
  • Is the list a column in your dataframe? Commented Feb 5, 2021 at 7:49
  • Yes, Its a column in the dataframe Commented Feb 5, 2021 at 7:50
  • 1
    It is a single column with a string (eg.: "apple fell on Newton") in each row of the data frame. For each row, I have to match it with key from the dictionary and return value of the corresponding key Commented Feb 5, 2021 at 7:53

2 Answers 2

2

It is a single column with a string (eg.: "apple fell on Newton") in each row of the data frame. For each row, I have to match it with key from the dictionary and return value of the corresponding key

Number of elements in the list is around 40-50 million.So, its taking a lot of time

IIUC, based on your comments, you can solve this easily with a str.extract and series.replace, both of which are vectorized functions without any loops.

  1. For using str.extract, you can create a regex pattern from the keys of the dictionary. This only extracts the keywords apple or lemon.
  2. You can use the dictionary d to then simply replace each of these directly with the corresponding values
l = ["apple fell on Newton", "lemon is yellow","grass is greener"]
d = {"apple" : "fruits", "lemon" : "vegetable"}

df = pd.DataFrame(l, columns=['sentences']) #Single column dataframe to demonstrate.

pattern = '('+'|'.join(d.keys())+')'   #Regular expression pattern
df['type'] = df.sentences.str.extract(pattern).replace(d)
print(df)
              sentences       type
0  apple fell on Newton     fruits
1       lemon is yellow  vegetable
2      grass is greener        NaN
Sign up to request clarification or add additional context in comments.

Comments

0

Check by applying the lambda function and store the values in string in the dataframe.

df['New_Col'] = df['sentences'].apply(lambda l: ', '.join([key for key, value in d.items() if value in l]))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.