0

this seems like it should be an easy problem to solve, but I've been battling with it and cannot seem to find a solution.

I have two dataframes of different sizes and different column names. I am trying to compare a column in dataframe A, with another in dataframe B, and only retain the rows in dataframe A, if the strings are an EXACT match with B. I am not looking for a partial match or sub string, but an EXACT FULL STRING match. I have checked close to 40 questions now, and still keep getting partial matches.

Dataframe A
   ID                 Text     score
0   1                admin      10.4
1   2                Care        2.0
2   3                Office      5.0
3   4                Doctor      0.5
4   5                be          0.2
5   6                to          0.9
6   7                in          0.8 

And lets assume that the second dataframe is:

Dataframe B
   ID                 Labels              Places
0   1                Office                Upper
1   2                administration       Lower
2   3                Doctor               Internal
3   4                Reception           Outer
4   5                Tone                 Outer
5   6                Skin                 Internal
6   7                Behaviour            Outer
7   8                Injury               Upper

My desired output is

Dataframe A
   ID                 Text     score
2   3               Office       5.0
3   4               Doctor       0.5

This will be from comparing DataframeA['Text'] with DataframeB['Labels'] and only keeping the exact matches.

I have tried

df_A_new = df_A[df_A['Text'].isin(df_b['Labels'])]

and the output it gives me is

   ID                 Text     score
0   1                admin      10.4
2   3               Office       5.0
3   4               Doctor       0.5
4   5                be          0.2
5   6                to          0.9
6   7                in          0.8 

it maps the substring admin and administration. Merge has not helped either.

df_a_new = df_a.merge(df_b, left_on='Text', right_on='Lables')[['ID', 'Text', 'score']]

I have checked so many answers on stackoverflow, but they all seem to be matching substrings! Can anyone help with this?

7
  • 1
    I tested and your expression df_A_new=(df_A[df_A['Text'].isin(df_B['Labels'])]) is correct and gives the desired result. Commented Sep 30, 2024 at 15:28
  • Thanks @PepeNO. This is a fake dataset for illustration. The actual dataset is larger and gives these substring mappings when I run it. For example, it retains 'be' as one of the matched entries, but that's not present in the second dataset. A search shows things like 'members', 'behaviour', etc, so I think there's some substring matching going on. Commented Sep 30, 2024 at 15:39
  • 1
    Then add dummy data that really produces the wrong result that you are mentioning because your current example doesn't do that. Commented Sep 30, 2024 at 15:47
  • I got the same result as @PepeNO It would be better if you provide data that produces the error. Commented Sep 30, 2024 at 15:51
  • 1
    With your new sample data still gives a correct result. ID Text score 2 3 Office 5.0 3 4 Doctor 0.5 what is your setup in versions or where are you loading your data from? Commented Sep 30, 2024 at 16:03

1 Answer 1

1

Here is my suggestion.

Ensure string consistency

  • Remove all hidden characters like extra spaces, tabs, or non-printable characters.
  • Case sensitivity can be another issue. If the match should be case-insensitive, we can normalize all strings to lowercase.

Use merge() with strict matching

  • Instead of isin(), use merge() which performs a more explicit join between DataFrames and guarantees exact matches.

Optionally check string length

  • If substrings like be are sneaking through, we can ensure that the strings have the exact same length as an additional check.

Here is the code.

import pandas as pd

df_A = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Text': ['admin', 'Care', 'Office', 'Doctor'],
    'score': [10.4, 2.0, 5.0, 0.5]
})

df_B = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Labels': ['Office', 'administration', 'Doctor', 'Reception'],
    'Places': ['Upper', 'Lower', 'Internal', 'Outer']
})

# Strip all leading/trailing spaces and normalize to lowercase (or keep case-sensitive if needed)
df_A['Text'] = df_A['Text'].str.strip().str.lower()
df_B['Labels'] = df_B['Labels'].str.strip().str.lower()

# Ensure exact length matching (optional, if substring matches are suspected)
df_A_new = df_A[df_A['Text'].apply(len) == df_A['Text'].apply(len)]

# Use merge() for strict exact matching
df_A_new = df_A.merge(df_B, left_on='Text', right_on='Labels', how='inner')[['ID_x', 'Text', 'score']]

# Optionally rename ID column
df_A_new.rename(columns={'ID_x': 'ID'}, inplace=True)

# Display the result
print(df_A_new)

I hope this will help you a little.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. I'll try these. I already tried the lower() and strip(), so I'll try the rest.
ok, let me know if you have any trouble.
Thanks Temunel. This appears to have solved the problem. I did the stripping and lower casing again and then merged and took out duplicates.
glad to hear that it was helpful. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.