I'm having an issue with updating a data frame when I already have values in the column.
Here is an example
import pandas as pd
df = pd.DataFrame({
'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'Name': ['John', 'Sam',None,None],
'id': ['A0', 'A1','A2', 'A3'], }
)
df
df
Name email id
0 John [email protected] A0
1 Sam [email protected] A1
2 None [email protected] A2
3 None [email protected] A3
ref_df = pd.DataFrame({
'email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'Name': ['', 'Sam','Tim','Sara'],
'random': ['f', 's','r', 'a'], }
)
ref_df
Name email random
0 [email protected] f
1 Sam [email protected] s
2 Tim [email protected] r
3 Sara [email protected] a
The result I want is below:
Name email id
0 John [email protected] A0
1 Sam [email protected] A1
2 Tim [email protected] A2
3 Sara [email protected] A3
I want to populate the Name with values in ref_df based on email, but keep the existing values. Only update null values in name. Also only keep the original columns in df.(get rid of the random columns in ref_df)
I also want to be able to do this repeatedly, because I want to update df with multiple ref_df from different sources.
below is what I have tried, this works if I run the code line by line, but once I wrap it in a function, I got a keyerror.
I'm sure there is a better way for doing this. Any help is appreciated!
def update_df(df, index, ref_df, ref_cols,how='inner',left_on=None,
right_on=None,):
df = init_columns(df, cols=ref_cols)
cols_to_keep = list(df.columns)
gap_cols = df.columns.difference(ref_df.columns)
gap_df = merge(
df[gap_cols],
ref_df,
how,
left_on,
right_on,
)
gap_df = gap_df[cols_to_keep].set_index(index)
df = df.set_index(index)
df.update(gap_df)
df=df[cols_to_keep]
return df