Compare two dataframes and check if combination in first df is in second

Question

I have two dataframes of city data with many rows and a few columns. I am trying to find a way to see if dfA values is in dfB and then print the value in dfA with its index for those that is in dfB in a list, then also for those in dfA that is NOT, in another list. The order of the information per row does not have to be in exact order in the two dfs, but in total, it must have the info as a whole. So for example, dfA Index 1 New York would be a match with dfB Index 3, and since in dfA, there is no row for Atlanta but in dfB it does, it will be printed in the second list.

For example below:

dfA

Index	Column 1	Column 2	Column 3
0	Albuquerque	NM	87101
1	New York	NY	10009
2	Miami	FL	33101

dfB

Index	Column 1	Column 2	Column 3
0	NM	Albuquerque	87101
1	Atlanta	GA	30033
2	San Francisco	CA	94016
3	10009	NY	New York

Show us the .merge() you already attempted, and supply a reprex that we can run. — J_H
– J_H, Commented Jan 9 at 17:26
Why is dfB inconsistent about the order of the columns? Can you fix that at the source? — Barmar
– Barmar, Commented Jan 10 at 0:54
Please provide the exact desired output. Your example for the second list seems to conflict with the description above. — wjandrea
– wjandrea, Commented Jan 10 at 3:09
Barmar- that is the point here. The columns can be mixed, or in different order as long as all three values are contained per row — ChairmanMeow
– ChairmanMeow, Commented Jan 10 at 14:50

Adeva1 · Accepted Answer · 2025-01-09 18:44:39Z

1

You can also do something like this:

dfB = dfB.reset_index(drop= False) # Only if you need the index 

common_list = []
uncommon_list = []

# Check if a value in column 2 (or any other) of dfB exists anywhere in dfA

i = 0
for item in dfB['Column 2']:
    rowB = dfB.iloc[i]
    
    if dfA.isin([item]).any().any():
        common_list.append(rowB.tolist())
    else:
        uncommon_list.append(rowB.tolist())
    
    i+=1

Output:

print(common_list)

[[0, 'NM', 'Albuquerque', 87101], [3, 10009, 'NY', 'New York']]

print(uncommon_list)

[[1, 'Atlanta', 'GA', 30033], [2, 'San Francisco', 'CA', 94016]]

edited Jan 9 at 18:44

answered Jan 9 at 18:36

Adeva1

6691 gold badge3 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

wjandrea Jan 10 at 0:52

IIUC, this is O(n*m) where n = len(dfA) and m = len(dfB). I assume Pandas' built-in tools would be faster, like in mozway's answer.

ouroboros1 · Accepted Answer · 2025-01-09 20:22:03Z

A common way to compare such dfs would be to use df.merge with indicator=True. Problem: dfB is in bad shape. So, let's fix that first.

dfB = (dfB.stack()
       .str.extract(r'(^\d{5}$)|(^[A-Z]{2}$)|(.+)')
       .groupby(level=0)
       .first()
       .astype({0: int})
       .iloc[:, ::-1]
       .set_axis(dfA.columns, axis='columns')
       )

out = dfA.merge(dfB, how='outer', indicator=True)

Output:

        Column 1 Column 2  Column 3      _merge
0    Albuquerque       NM     87101        both # in `dfA` and `dfB`
1        Atlanta       GA     30033  right_only # only in `dfB`
2          Miami       FL     33101   left_only # only in `dfA`
3       New York       NY     10009        both
4  San Francisco       CA     94016  right_only

I.e., now you can use how='inner' to get dfA present in dfB:

dfA.merge(dfB, how='inner')

      Column 1 Column 2  Column 3
0  Albuquerque       NM     87101
1     New York       NY     10009

And how='outer' + df.query to get all rows not in 'both' + df.drop to drop '_merge' afterwards:

(dfA.merge(dfB, how='outer', indicator=True).query('_merge != "both"')
 .drop('_merge', axis=1))

        Column 1 Column 2  Column 3
1        Atlanta       GA     30033
2          Miami       FL     33101
4  San Francisco       CA     94016

Explanation / Intermediates

Use df.stack to turn dfB into a series and apply Series.str.extract to get zip code (5 digits), state (2 capital letters), city (anything else) into different columns. Explanation of the regex pattern here.

dfB.stack().str.extract(r'(^\d{5}$)|(^[A-Z]{2}$)|(.+)')

                0    1              2
0 Column 1    NaN   NM            NaN
  Column 2    NaN  NaN    Albuquerque
  Column 3  87101  NaN            NaN
1 Column 1    NaN  NaN        Atlanta
  Column 2    NaN   GA            NaN
  Column 3  30033  NaN            NaN
2 Column 1    NaN  NaN  San Francisco
  Column 2    NaN   CA            NaN
  Column 3  94016  NaN            NaN
3 Column 1  10009  NaN            NaN
  Column 2    NaN   NY            NaN
  Column 3    NaN  NaN       New York

Now, use df.groupby on index level 0 and get groupby.first.

# .groupby(level=0).first()

       0   1              2
0  87101  NM    Albuquerque
1  30033  GA        Atlanta
2  94016  CA  San Francisco
3  10009  NY       New York

Apply df.astype to convert the zip codes to int.
Use df.iloc to fix the order (city, state, zip code) and df.set_axis to align the column names with dfA.
Finally, apply different versions of df.merge.

Data used

import pandas as pd

dataA = {'Column 1': {0: 'Albuquerque', 1: 'New York', 2: 'Miami'}, 
         'Column 2': {0: 'NM', 1: 'NY', 2: 'FL'}, 
         'Column 3': {0: 87101, 1: 10009, 2: 33101}}
dfA = pd.DataFrame(dataA)

dataB = {'Column 1': {0: 'NM', 1: 'Atlanta', 2: 'San Francisco', 3: '10009'}, 
         'Column 2': {0: 'Albuquerque', 1: 'GA', 2: 'CA', 3: 'NY'}, 
         'Column 3': {0: '87101', 1: '30033', 2: '94016', 3: 'New York'}}
dfB = pd.DataFrame(dataB)

mozway · Accepted Answer · 2025-01-09 21:02:21Z

You could aggregate each row as a frozenset and use it to create keys to map the indices of dfB.

Prerequisite: let's ensure Index is the index and if the dtypes are mixed, convert everything to string

# optional, only if Index is a column
dfA.set_index('Index', inplace=True)
dfB.set_index('Index', inplace=True)

# only if dfA has 87101 (int) and dfB has "87101" (str)
dfA = dfA.astype(str)
dfB = dfB.astype(str)

Then aggregate, create the keys and map:

keyA = dfA.apply(frozenset, axis=1)
keyB = dfB.apply(frozenset, axis=1)

dfA['matching_row'] = keyA.map(pd.Series(dfB.index, index=keyB))

Output:

          Column 1 Column 2 Column 3  matching_row
Index                                             
0      Albuquerque       NM    87101           0.0
1         New York       NY    10009           3.0
2            Miami       FL    33101           NaN

If your goal is just to split dfA without identifying the specific rows from dfB, you could just use the above created keys with isin+groupby:

isin_dfB = dict(list(dfA.groupby(keyA.isin(keyB))))

isin_dfB[True]
#           Column 1 Column 2 Column 3
# Index                               
# 0      Albuquerque       NM    87101
# 1         New York       NY    10009

isin_dfB[False]
#       Column 1 Column 2 Column 3
# Index                           
# 2        Miami       FL    33101

rehaqds · Accepted Answer · 2025-01-09 22:03:13Z

1

One solution without explicit for loop:

resA = dfA[dfA['Column 1'].map(lambda x: x in dfB.values)]

       Column1 Column2  Column3
0  Albuquerque      NM    87101
1     New-York      NY    10009

inB = dfA['Column 1'].map(lambda x: np.where(dfB == x)[0]) 
inB = [x[0] for x in inB if len(x)>0]
resB = dfB.loc[[x for x in dfB.index if x not in inB]]

         Column1 Column2 Column3
1        Atlanta      GA   30033
2  San-Francisco      CA   94016

And if you also want here, the cities in dfA which are not in dfB:

resAnotB = dfA.loc[[x for x in dfA.index if x not in resA.index]]
resB2 = pd.concat([resB, resAnotB])

         Column1 Column2 Column3
1        Atlanta      GA   30033
2  San-Francisco      CA   94016
2          Miami      FL   33101

edited Jan 9 at 22:03

answered Jan 9 at 19:15

rehaqds

2,2452 gold badges6 silver badges16 bronze badges

Comments

wjandrea · Accepted Answer · 2025-01-10 00:48:00Z

One way of doing this it to define a function that will look at each row of dfA as a list and compare with the same thing from dfB:

import pandas as pd

data_A = {'Index': [0, 1, 2], 'Column 1': ['Albuquerque', 'New York', 'Miami'], 'Column 2': ['NM', 'NY', 'FL'], 'Column 3': ['87101', '10009', '33101']}
dfA = pd.DataFrame(data_A).set_index('Index')
data_B = {'Index': [0, 1, 2, 3], 'Column 1': ['NM', 'Atlanta', 'San Francisco', '10009'], 'Column 2': ['Albuquerque', 'GA', 'CA', 'NY'], 'Column 3': ['87101', '30033', '94016', 'New York']}
dfB = pd.DataFrame(data_B).set_index('Index')
print(dfA)
print(dfB)

def match_rows(dfA, dfB):
    in_both = []
    not_in_both = []
    for index_a, row_a in dfA.iterrows():
        match_found = False
        for _, row_b in dfB.iterrows():
            if set(row_a) == set(row_b):
                in_both.append((index_a, row_a.to_list()))
                match_found = True
                break
        if not match_found:
            not_in_both.append((index_a, row_a.to_list()))
    
    for index_b, row_b in dfB.iterrows():
        if not any(set(row_b) == set(row_a) for _, row_a in dfA.iterrows()):
            not_in_both.append((index_b, row_b.to_list()))
    
    return in_both, not_in_both

matches, non_matches = match_rows(dfA, dfB)
matches, non_matches

def format_output(matches, non_matches):
    formatted_matches = [
        f"Matching from dfA, Index {index}: {values}"
        for index, values in matches
    ]
    formatted_non_matches = [
        f"Not Matching from {'dfA' if index in dfA.index else 'dfB'}, Index {index}: {values}"
        for index, values in non_matches
    ]
    return formatted_matches, formatted_non_matches

formatted_matches, formatted_non_matches = format_output(matches, non_matches)
formatted_matches, formatted_non_matches

I introduced a second function here to format the output in a more understandable way:

(["Matching from dfA, Index 0: ['Albuquerque', 'NM', '87101']",
  "Matching from dfA, Index 1: ['New York', 'NY', '10009']"],
 ["Not Matching from dfA, Index 2: ['Miami', 'FL', '33101']",
  "Not Matching from dfA, Index 1: ['Atlanta', 'GA', '30033']",
  "Not Matching from dfA, Index 2: ['San Francisco', 'CA', '94016']"])

Collectives™ on Stack Overflow

Compare two dataframes and check if combination in first df is in second

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related