Python Pandas-Update a data frame with values from another, without replacing existing

Question

I'm having an issue with updating a data frame when I already have values in the column.

Here is an example

import pandas as pd
df = pd.DataFrame({
                  'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                  'Name': ['John', 'Sam',None,None],
                  'id': ['A0', 'A1','A2', 'A3'], }
                     )
df

    df
        Name    email   id
    0   John    [email protected] A0
    1   Sam     [email protected] A1
    2   None    [email protected] A2
    3   None    [email protected] A3

ref_df = pd.DataFrame({
                  'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                  'Name': ['', 'Sam','Tim','Sara'],
                  'random': ['f', 's','r', 'a'], }
                     )
ref_df
Name           email      random
0           [email protected]     f
1   Sam     [email protected]     s
2   Tim     [email protected]     r
3   Sara    [email protected]     a

The result I want is below:

Name           email    id
0   John    [email protected] A0
1   Sam     [email protected] A1
2   Tim     [email protected] A2
3   Sara    [email protected] A3

I want to populate the Name with values in ref_df based on email, but keep the existing values. Only update null values in name. Also only keep the original columns in df.(get rid of the random columns in ref_df)

I also want to be able to do this repeatedly, because I want to update df with multiple ref_df from different sources.

below is what I have tried, this works if I run the code line by line, but once I wrap it in a function, I got a keyerror.

I'm sure there is a better way for doing this. Any help is appreciated!

def update_df(df, index, ref_df, ref_cols,how='inner',left_on=None,
              right_on=None,):
    df = init_columns(df, cols=ref_cols)
    cols_to_keep = list(df.columns)
    gap_cols = df.columns.difference(ref_df.columns)
    gap_df = merge(
        df[gap_cols],
        ref_df,
        how,
        left_on,
        right_on,
    )
    gap_df = gap_df[cols_to_keep].set_index(index)
    df = df.set_index(index)
    df.update(gap_df)
    df=df[cols_to_keep]
    return df

Yilun Zhang · Accepted Answer · 2018-02-09 17:04:12Z

1

What I did is to convert your ref_df as a dictionary so that we can apply mapping.

ref_dict = dict(zip(ref_df["email"], ref_df["Name"]))
ref_dict

This will give you:

{'[email protected]': 'John',
 '[email protected]': 'Sam',
 '[email protected]': 'Tim',
 '[email protected]': 'Sara'}

Then, you can:

df["Name"] = df["email"].map(ref_dict)

And you will have:

    Name          email id
0   John    [email protected] A0
1   Sam     [email protected] A1
2   Tim     [email protected] A2
3   Sara    [email protected] A3

This will recreate the Name column and if you are afraid that this might change some of the existing values, you can only fill those that are NAs.

answered Feb 9, 2018 at 17:04

Yilun Zhang

9,0685 gold badges35 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jpp · Accepted Answer · 2018-02-09 17:09:23Z

0

This should work:

df['Name'] = df['Name'].fillna(df['email'].map(ref_df.set_index('email')['Name']))

The way this works is to create an email to Name mapping from ref_df, then use it to fill blanks in your dataframe.

answered Feb 9, 2018 at 17:09

jpp

166k37 gold badges301 silver badges363 bronze badges

Collectives™ on Stack Overflow

Python Pandas-Update a data frame with values from another, without replacing existing

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related