4
df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"], 
                              ["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]}) 

and I want to add a column to df1 indicating that the fields overlap with other fields, sample output:

|fields| overlap?|
|------|---------|
|boy   |Y
|apple |N
|toy   |N
|orange|Y
|bear  |N
|eat   |N
|orange|Y
|girl  |Y
|red   |N

first I will explode fields on df1, but I am not sure what the next steps are to check overlap values between 2 dataframes. Thanks!

2
  • 1
    Why is the first orange in df1 a 'Y' for overlap, but the second is not? Commented Sep 15, 2022 at 10:19
  • sorry its a typo Commented Sep 15, 2022 at 13:00

5 Answers 5

3

You can also do it without apply. As you said you can explode, then using isin you can check whether values exist in df2 which will return True / False and then mapping 'Y' / 'N' on that:

df1_exp = df1.explode('fields',ignore_index=True)
df1_exp['overlap'] = df1_exp['fields'].isin(df2['other fields']).map({True:'Y',False:'N'})


   fields overlap
0     boy       Y
1   apple       N
2     toy       N
3  orange       Y
4    bear       N
5     eat       N
6  orange       Y
7    girl       Y
8     red       N
Sign up to request clarification or add additional context in comments.

Comments

2

You can use isin to find overlapping values after exploding both dfs and change the bool to Y/N using np.where

df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"], ["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]]})

df1 = df1.explode('fields', ignore_index=True)
df1['overlap'] = np.where(df1['fields'].isin(df2['other fields'].explode()), 'Y', 'N')
print(df1)

Output

   fields overlap
0     boy       Y
1   apple       N
2     toy       N
3  orange       Y
4    bear       N
5     eat       N
6  orange       Y
7    girl       Y
8     red       N

Comments

1

this should work

df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].apply(lambda x: "Y" if x in df2["other fields"].values else "N")
    fields  overlap
0   boy     Y
0   apple   N
0   toy     N
0   orange  Y
0   bear    N
0   eat     N
1   orange  Y
1   girl    Y
1   red     N

1 Comment

You've got a Y for both entries of orange, only the first has one in the example. OP hasn't clarified if this is desired or a typo
1

You can try .isin():

df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].isin(df2["other fields"][0])

You can later replace the True/False with Y/N

Comments

1

Another way is using np.select. I normally use it for huge data sets where some methods may take a while to be executed:

df1 = df1.explode(column='fields')
df1['overlap'] = np.select([df1.fields.isin(df2['other fields'])], ['Y'], 'N')

   index  fields overlap
0      0     boy       Y
1      0   apple       N
2      0     toy       N
3      0  orange       Y
4      0    bear       N
5      0     eat       N
6      1  orange       Y
7      1    girl       Y
8      1     red       N

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.