0

I have two dataframes of city data with many rows and a few columns. I am trying to find a way to see if dfA values is in dfB and then print the value in dfA with its index for those that is in dfB in a list, then also for those in dfA that is NOT, in another list. The order of the information per row does not have to be in exact order in the two dfs, but in total, it must have the info as a whole. So for example, dfA Index 1 New York would be a match with dfB Index 3, and since in dfA, there is no row for Atlanta but in dfB it does, it will be printed in the second list.

For example below:

dfA

Index Column 1 Column 2 Column 3
0 Albuquerque NM 87101
1 New York NY 10009
2 Miami FL 33101

dfB

Index Column 1 Column 2 Column 3
0 NM Albuquerque 87101
1 Atlanta GA 30033
2 San Francisco CA 94016
3 10009 NY New York
4
  • Show us the .merge() you already attempted, and supply a reprex that we can run. Commented Jan 9 at 17:26
  • 1
    Why is dfB inconsistent about the order of the columns? Can you fix that at the source? Commented Jan 10 at 0:54
  • Please provide the exact desired output. Your example for the second list seems to conflict with the description above. Commented Jan 10 at 3:09
  • Barmar- that is the point here. The columns can be mixed, or in different order as long as all three values are contained per row Commented Jan 10 at 14:50

5 Answers 5

1

You can also do something like this:

dfB = dfB.reset_index(drop= False) # Only if you need the index 

common_list = []
uncommon_list = []

# Check if a value in column 2 (or any other) of dfB exists anywhere in dfA

i = 0
for item in dfB['Column 2']:
    rowB = dfB.iloc[i]
    
    if dfA.isin([item]).any().any():
        common_list.append(rowB.tolist())
    else:
        uncommon_list.append(rowB.tolist())
    
    i+=1

Output:

print(common_list)

[[0, 'NM', 'Albuquerque', 87101], [3, 10009, 'NY', 'New York']]

print(uncommon_list)

[[1, 'Atlanta', 'GA', 30033], [2, 'San Francisco', 'CA', 94016]]
Sign up to request clarification or add additional context in comments.

1 Comment

IIUC, this is O(n*m) where n = len(dfA) and m = len(dfB). I assume Pandas' built-in tools would be faster, like in mozway's answer.
1

A common way to compare such dfs would be to use df.merge with indicator=True. Problem: dfB is in bad shape. So, let's fix that first.

dfB = (dfB.stack()
       .str.extract(r'(^\d{5}$)|(^[A-Z]{2}$)|(.+)')
       .groupby(level=0)
       .first()
       .astype({0: int})
       .iloc[:, ::-1]
       .set_axis(dfA.columns, axis='columns')
       )

out = dfA.merge(dfB, how='outer', indicator=True)

Output:

        Column 1 Column 2  Column 3      _merge
0    Albuquerque       NM     87101        both # in `dfA` and `dfB`
1        Atlanta       GA     30033  right_only # only in `dfB`
2          Miami       FL     33101   left_only # only in `dfA`
3       New York       NY     10009        both
4  San Francisco       CA     94016  right_only

I.e., now you can use how='inner' to get dfA present in dfB:

dfA.merge(dfB, how='inner')

      Column 1 Column 2  Column 3
0  Albuquerque       NM     87101
1     New York       NY     10009

And how='outer' + df.query to get all rows not in 'both' + df.drop to drop '_merge' afterwards:

(dfA.merge(dfB, how='outer', indicator=True).query('_merge != "both"')
 .drop('_merge', axis=1))

        Column 1 Column 2  Column 3
1        Atlanta       GA     30033
2          Miami       FL     33101
4  San Francisco       CA     94016

Explanation / Intermediates

  • Use df.stack to turn dfB into a series and apply Series.str.extract to get zip code (5 digits), state (2 capital letters), city (anything else) into different columns. Explanation of the regex pattern here.
dfB.stack().str.extract(r'(^\d{5}$)|(^[A-Z]{2}$)|(.+)')

                0    1              2
0 Column 1    NaN   NM            NaN
  Column 2    NaN  NaN    Albuquerque
  Column 3  87101  NaN            NaN
1 Column 1    NaN  NaN        Atlanta
  Column 2    NaN   GA            NaN
  Column 3  30033  NaN            NaN
2 Column 1    NaN  NaN  San Francisco
  Column 2    NaN   CA            NaN
  Column 3  94016  NaN            NaN
3 Column 1  10009  NaN            NaN
  Column 2    NaN   NY            NaN
  Column 3    NaN  NaN       New York
# .groupby(level=0).first()

       0   1              2
0  87101  NM    Albuquerque
1  30033  GA        Atlanta
2  94016  CA  San Francisco
3  10009  NY       New York
  • Apply df.astype to convert the zip codes to int.
  • Use df.iloc to fix the order (city, state, zip code) and df.set_axis to align the column names with dfA.
  • Finally, apply different versions of df.merge.

Data used

import pandas as pd

dataA = {'Column 1': {0: 'Albuquerque', 1: 'New York', 2: 'Miami'}, 
         'Column 2': {0: 'NM', 1: 'NY', 2: 'FL'}, 
         'Column 3': {0: 87101, 1: 10009, 2: 33101}}
dfA = pd.DataFrame(dataA)

dataB = {'Column 1': {0: 'NM', 1: 'Atlanta', 2: 'San Francisco', 3: '10009'}, 
         'Column 2': {0: 'Albuquerque', 1: 'GA', 2: 'CA', 3: 'NY'}, 
         'Column 3': {0: '87101', 1: '30033', 2: '94016', 3: 'New York'}}
dfB = pd.DataFrame(dataB)

Comments

1

You could aggregate each row as a frozenset and use it to create keys to map the indices of dfB.

Prerequisite: let's ensure Index is the index and if the dtypes are mixed, convert everything to string

# optional, only if Index is a column
dfA.set_index('Index', inplace=True)
dfB.set_index('Index', inplace=True)

# only if dfA has 87101 (int) and dfB has "87101" (str)
dfA = dfA.astype(str)
dfB = dfB.astype(str)

Then aggregate, create the keys and map:

keyA = dfA.apply(frozenset, axis=1)
keyB = dfB.apply(frozenset, axis=1)

dfA['matching_row'] = keyA.map(pd.Series(dfB.index, index=keyB))

Output:

          Column 1 Column 2 Column 3  matching_row
Index                                             
0      Albuquerque       NM    87101           0.0
1         New York       NY    10009           3.0
2            Miami       FL    33101           NaN

If your goal is just to split dfA without identifying the specific rows from dfB, you could just use the above created keys with isin+groupby:

isin_dfB = dict(list(dfA.groupby(keyA.isin(keyB))))

isin_dfB[True]
#           Column 1 Column 2 Column 3
# Index                               
# 0      Albuquerque       NM    87101
# 1         New York       NY    10009

isin_dfB[False]
#       Column 1 Column 2 Column 3
# Index                           
# 2        Miami       FL    33101

Comments

1

One solution without explicit for loop:

resA = dfA[dfA['Column 1'].map(lambda x: x in dfB.values)]

       Column1 Column2  Column3
0  Albuquerque      NM    87101
1     New-York      NY    10009 

inB = dfA['Column 1'].map(lambda x: np.where(dfB == x)[0]) 
inB = [x[0] for x in inB if len(x)>0]
resB = dfB.loc[[x for x in dfB.index if x not in inB]]

         Column1 Column2 Column3
1        Atlanta      GA   30033
2  San-Francisco      CA   94016

And if you also want here, the cities in dfA which are not in dfB:

resAnotB = dfA.loc[[x for x in dfA.index if x not in resA.index]]
resB2 = pd.concat([resB, resAnotB])

         Column1 Column2 Column3
1        Atlanta      GA   30033
2  San-Francisco      CA   94016
2          Miami      FL   33101

Comments

1

One way of doing this it to define a function that will look at each row of dfA as a list and compare with the same thing from dfB:

import pandas as pd

data_A = {'Index': [0, 1, 2], 'Column 1': ['Albuquerque', 'New York', 'Miami'], 'Column 2': ['NM', 'NY', 'FL'], 'Column 3': ['87101', '10009', '33101']}
dfA = pd.DataFrame(data_A).set_index('Index')
data_B = {'Index': [0, 1, 2, 3], 'Column 1': ['NM', 'Atlanta', 'San Francisco', '10009'], 'Column 2': ['Albuquerque', 'GA', 'CA', 'NY'], 'Column 3': ['87101', '30033', '94016', 'New York']}
dfB = pd.DataFrame(data_B).set_index('Index')
print(dfA)
print(dfB)

def match_rows(dfA, dfB):
    in_both = []
    not_in_both = []
    for index_a, row_a in dfA.iterrows():
        match_found = False
        for _, row_b in dfB.iterrows():
            if set(row_a) == set(row_b):
                in_both.append((index_a, row_a.to_list()))
                match_found = True
                break
        if not match_found:
            not_in_both.append((index_a, row_a.to_list()))
    
    for index_b, row_b in dfB.iterrows():
        if not any(set(row_b) == set(row_a) for _, row_a in dfA.iterrows()):
            not_in_both.append((index_b, row_b.to_list()))
    
    return in_both, not_in_both

matches, non_matches = match_rows(dfA, dfB)
matches, non_matches

def format_output(matches, non_matches):
    formatted_matches = [
        f"Matching from dfA, Index {index}: {values}"
        for index, values in matches
    ]
    formatted_non_matches = [
        f"Not Matching from {'dfA' if index in dfA.index else 'dfB'}, Index {index}: {values}"
        for index, values in non_matches
    ]
    return formatted_matches, formatted_non_matches

formatted_matches, formatted_non_matches = format_output(matches, non_matches)
formatted_matches, formatted_non_matches

I introduced a second function here to format the output in a more understandable way:

(["Matching from dfA, Index 0: ['Albuquerque', 'NM', '87101']",
  "Matching from dfA, Index 1: ['New York', 'NY', '10009']"],
 ["Not Matching from dfA, Index 2: ['Miami', 'FL', '33101']",
  "Not Matching from dfA, Index 1: ['Atlanta', 'GA', '30033']",
  "Not Matching from dfA, Index 2: ['San Francisco', 'CA', '94016']"])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.