1

I have two dataframes,

df1,

 Names
 one two three
 Sri is a good player
 Ravi is a mentor
 Kumar is a cricketer player

df2,

 values
 sri
 NaN
 sri, is
 kumar,cricketer player

I am trying to get the row in df1 which contains the all the items in df2

My expected output is,

 values                  Names
 sri                     Sri is a good player
 NaN
 sri, is                 Sri is a good player
 kumar,cricketer player  Kumar is a cricketer player

i tried, df1["Names"].str.contains("|".join(df2["values"].values.tolist())) I also tried,

but I cannot achieve my expected output as it has (","). Please help

2 Answers 2

2

Using set logic with Numpy broadcasting.

d1 = df1['Names'].fillna('').str.lower().str.split('[^a-z]+').apply(set).values
d2 = df2['values'].fillna('').str.lower().str.split('[^a-z]+').apply(set).values

i, j = np.where(d1 >= d2[:, None])

df2.assign(Names=pd.Series(df1['Names'].values[j], df2['values'].index[i]))

                   values                        Names
0                     sri         Sri is a good player
1                     NaN                          NaN
2                 sri, is         Sri is a good player
3  kumar,cricketer player  Kumar is a cricketer player
Sign up to request clarification or add additional context in comments.

Comments

1

Try -

import pandas as pd

df1 = pd.read_csv('sample.csv')
df2 = pd.read_csv('sample_2.csv')

df2['values']= df2['values'].str.lower()
df1['names']= df1['names'].str.lower()

df2["values"] = df2['values'].str.replace('[^\w\s]',' ')
df2['values']= df2['values'].replace('\s+', ' ', regex=True)

df1["names"] = df1['names'].str.replace('[^\w\s]',' ')
df1['names']= df1['names'].replace('\s+', ' ', regex=True)

df2['list_values'] = df2['values'].apply(lambda x: str(x).split())
df1['list_names'] = df1['names'].apply(lambda x: str(x).split())

list_names = df1['list_names'].tolist()

def check_names(x, list_names):
    output = ''
    for list_name in list_names:
        if set(list_name) >= set(x):
            output = ' '.join(list_name)
            break
    return output

df2['Names'] = df2['list_values'].apply(lambda x: check_names(x, list_names))
print(df2)

Output

values                        Names
0                     sri         sri is a good player
1                     NaN                             
2                  sri is         sri is a good player
3  kumar cricketer player  kumar is a cricketer player

Exaplanation

It's a fuzzy matching problem. So here are the steps that I have applied -

  1. Remove punctuations and split to get unique words on both df
  2. Lowercase everything for standardized matching.
  3. Convert by splitting the string into lists.
  4. Finally doing the matching via the check_names() function to get the desired output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.