0

I'm having an issue with updating a data frame when I already have values in the column.

Here is an example

import pandas as pd
df = pd.DataFrame({
                  'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                  'Name': ['John', 'Sam',None,None],
                  'id': ['A0', 'A1','A2', 'A3'], }
                     )
df

    df
        Name    email   id
    0   John    [email protected] A0
    1   Sam     [email protected] A1
    2   None    [email protected] A2
    3   None    [email protected] A3

ref_df = pd.DataFrame({
                  'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
                  'Name': ['', 'Sam','Tim','Sara'],
                  'random': ['f', 's','r', 'a'], }
                     )
ref_df
Name           email      random
0           [email protected]     f
1   Sam     [email protected]     s
2   Tim     [email protected]     r
3   Sara    [email protected]     a

The result I want is below:

Name           email    id
0   John    [email protected] A0
1   Sam     [email protected] A1
2   Tim     [email protected] A2
3   Sara    [email protected] A3

I want to populate the Name with values in ref_df based on email, but keep the existing values. Only update null values in name. Also only keep the original columns in df.(get rid of the random columns in ref_df)

I also want to be able to do this repeatedly, because I want to update df with multiple ref_df from different sources.

below is what I have tried, this works if I run the code line by line, but once I wrap it in a function, I got a keyerror.

I'm sure there is a better way for doing this. Any help is appreciated!

def update_df(df, index, ref_df, ref_cols,how='inner',left_on=None,
              right_on=None,):
    df = init_columns(df, cols=ref_cols)
    cols_to_keep = list(df.columns)
    gap_cols = df.columns.difference(ref_df.columns)
    gap_df = merge(
        df[gap_cols],
        ref_df,
        how,
        left_on,
        right_on,
    )
    gap_df = gap_df[cols_to_keep].set_index(index)
    df = df.set_index(index)
    df.update(gap_df)
    df=df[cols_to_keep]
    return df

2 Answers 2

1

What I did is to convert your ref_df as a dictionary so that we can apply mapping.

ref_dict = dict(zip(ref_df["email"], ref_df["Name"]))
ref_dict

This will give you:

{'[email protected]': 'John',
 '[email protected]': 'Sam',
 '[email protected]': 'Tim',
 '[email protected]': 'Sara'}

Then, you can:

df["Name"] = df["email"].map(ref_dict)

And you will have:

    Name          email id
0   John    [email protected] A0
1   Sam     [email protected] A1
2   Tim     [email protected] A2
3   Sara    [email protected] A3

This will recreate the Name column and if you are afraid that this might change some of the existing values, you can only fill those that are NAs.

Sign up to request clarification or add additional context in comments.

Comments

0

This should work:

df['Name'] = df['Name'].fillna(df['email'].map(ref_df.set_index('email')['Name']))

The way this works is to create an email to Name mapping from ref_df, then use it to fill blanks in your dataframe.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.