0

How can I merge between one dataframe based on the other lookup dataframe.

This is dataframe A where I want to replace the values :

  InfoType  IncidentType    DangerType
0   NaN          A             NaN
1   NaN          C             NaN
2   NaN          B            C
3   NaN          B            NaN

This is the lookup table :

    ID  ParamCode   ParamValue  ParmDesc1   ParamDesc2  SortOrder   ParamStatus
0   1   IncidentType    A       ABC            DEF          1            1
1   2   IncidentType    B       GHI            JKL          2            1
2   3   IncidentType    C       MNO            PQR          7            1
2   3   DangerType      C       STU            VWX          6            1

The expected input :

  InfoType  IncidentType    DangerType
0   NaN          ABC           NaN
1   NaN          MNO           NaN
2   NaN          GHI           STU
3   NaN          GHI           NaN

Note that ParamCode is the column names and I need to replace ParamDesc1 into respective columns in dataframe A. Every column in dataframe A may have NaN and I don't intend to remove them. Just ignore them.

This is what I have done :

ntf_cols = ['InfoType','IncidentType','DangerType']
for c in ntf_cols:
    if (c in ntf.columns) & (c in param['ParamCode'].values):
        paramValue = param['ParamValue'].unique()
        for idx, pv in enumerate(paramValue):
            ntf['NewIncidentType'] = pd.np.where(ntf.IncidentType.str.contains(pv), param['ParmDesc1'].values, "whatever")

Error :

ValueError: operands could not be broadcast together with shapes (25,) (13,) ()

6
  • This is a common duplicate. See, e.g., stackoverflow.com/questions/36413993/… Commented Apr 21, 2020 at 2:47
  • @Eric Truett, Before this gets voted closed as duplicate, is there a better matching question? OP appears to need to reference the values of two columns in the other df not one. @ dee would improve your question some if your provided what you have tried thus far to solve this problem. Commented Apr 21, 2020 at 2:53
  • How is this a duplicate? @EricTruett My question needs to compare between a column with values within the dataframe. It is quite different. It is not between two columns. Commented Apr 21, 2020 at 2:58
  • @Phillyclause89 I tried multiple times but I keep deleting them. I cant wrap my head around since the comparison is between columns name and values in another dataframe. I am not sure how to map them in this case. Commented Apr 21, 2020 at 3:00
  • 1
    @Phillyclause89 updated with code I've been trying to do. Commented Apr 21, 2020 at 3:26

2 Answers 2

0

Use the lookup table to make a dict, and then replace the column values of the original dataframe. Assume the original dataframe is df1 and the lookup table is df2

...
dict_map = dict(zip(df2.ParamCode + "-" + df2.ParamValue, df2.ParmDesc1))

df1['IncidentType'] = ("IncidentType" +'-'+ df1.IncidentType).replace(dict_map)
df1['DangerType'] = ("DangerType" +'-'+ df1.DangerType).replace(dict_map)
...
Sign up to request clarification or add additional context in comments.

3 Comments

this seems to work. I have a question how can I replace df1.IncidentType with variable. Rather than hardcode it? Because I need to apply this to every column. I can show you my code.
use a for loop through the columns you want to apply, something like for col in df1.columns: df1[col] = (col+"-"+df1[col]).replace(dict_map)
Is there some columns which dtype is float64, you should change it to str. for col in df1.columns: df1[col] = (col+"-"+df1[col].astype(str)).replace(dict_map)
0

EDIT: Lambda's answer gave me an idea for how you could do this for many columns that you want to apply this logical pattern to:

import pandas as pd

df1 = pd.DataFrame(dict(
    InfoType = [None, None, None, None],
    IncidentType = 'A C B B'.split(),
    DangerType = [None, None, 'C', None],
))

df2 = pd.DataFrame(dict(
    ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
    ParamValue  = 'A B C C'.split(),
    ParmDesc1 = 'ABC GHI MNO STU'.split(),
))


for col in df1.columns[1:]:
    dict_map = dict(
        df2[df2.ParamCode == col][['ParamValue','ParmDesc1']].to_records(index=False)
    )
    df1[col] = df1[col].replace(dict_map)

print(df1)

This assumes every column after the first column in df1 is one that needs updating and the to-be updated column names exists as a values in the 'ParamCode' column of df2.

Python tutor link to code


This problem could be solved using some custom functions and pandas.Series.apply():

import pandas as pd

def find_incident_type(x):
    if pd.isna(x):
        return x
    return df2[
        (df2['ParamCode'] == 'IncidentType') & (df2['ParamValue']==x)
    ]["ParmDesc1"].values[0]


def find_danger_type(x):
    if pd.isna(x):
        return x
    return df2[
        (df2['ParamCode'] == 'DangerType') & (df2['ParamValue']==x)
    ]["ParmDesc1"].values[0]


df1 = pd.DataFrame(dict(
    InfoType = [None, None, None, None],
    IncidentType = 'A C B B'.split(),
    DangerType = [None, None, 'C', None],
))

df2 = pd.DataFrame(dict(
    ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
    ParamValue  = 'A B C C'.split(),
    ParmDesc1 = 'ABC GHI MNO STU'.split(),
))

df1['IncidentType'] = df1['IncidentType'].apply(find_incident_type)
df1['DangerType'] = df1['DangerType'].apply(find_danger_type)

print(df1)

step through the code in python tutor

It is very possible there is a more efficient way to do this. Hopefully some one who knows it will share it.

Also the ref to df2 from the outer scope is hard coded into the custom functions and thus will only work for that variable name from the outer scope. You'll need to use an argument for pandas.Series.apply's args param if you don't want these functions to be dependent on that ref.

3 Comments

Thank you very much. I need to look at this soon! But I try to avoid apply because of performance issue. I'm still trying to write up my own and do comparison with yours as yoy stated above, I can't hard code the columns, as I have other columns to transform as well.
@dee My idea here was to make a custom function for each column in df1 that you need to replace values for based on specific values from df2 (your lookup table) you then call the pandas.Series.apply method with the respective function for each column that needs values replaced.
Hey, I also tried your second solution and it worked just as expected! Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.