I have a dataframe containing some data that may or may not contain values in the str_id column (note the index on this dataframe). I need to check this dataframe against another dataframe containing cached data. If there is a single matching (based on match_val) row in the cache then I need to copy that rows str_id back into the original dataframe's column without destroying any existing data in the column. If there is more than one matching row, it should add an error to the original dataframe. These requirements are because, later on, I'll be adding extra functionality to do further cache checks and matches on other columns (e.g. name) if the first cache check returns nothing.
DF1
str_id match_val name
id
3 None 12345 foo
6 None 67890 bar
9 None None delta
12 existing None leave
CACHE
str_id match_val name
0 abcde 12345 alpha
1 edcba 12345 beta
2 ghij 67890 gamma
3 uofd 11111 delta
4 kfsl 11111 epsilon
5 xyz None theta
The desired result should be,
DF1
str_id match_val name error
id
3 None 12345 foo duplicate
6 ghij 67890 bar None
9 None None delta None
12 existing None leave None
I have the following code so far that will correctly calculate single and multiple matches,
df1["legacy_id"] = df1.index
merged = pandas.merge(left=df1[df1['match_val'].notnull()], right=df_cache[df_cache['match_val'].notnull()], on='match_val', how='left', suffixes=['_df', '_cache'])
merge_count = merged.groupby('match_val').size()
errors = merged[merged['match_val'].isin(merge_count[merge_count>1].index)][['legacy_id']]
errors['error'] = 'Duplicate match_val'
errors.set_index('legacy_id', inplace=True)
errors = errors[~errors.index.duplicated()]
matches = merged[merged['match_val'].isin(merge_count[merge_count==1].index)][['legacy_id', 'match_val', 'str_id_cache']]
matches.set_index('legacy_id', inplace=True)
But I can't figure out how to incorporate the data correctly back into the original dataframe. If I assign the columns it destroys the data in the row with the existing str_id (see below). And I would assume that in subsequent cache checks this would be true for both the str_id and error columns. So how can I assign the data only for rows whose indexes match? Or is there another completely different way that I should be attempting this?
df1['str_id'] = matches['str_id_cache']
df1['error'] = errors['error']
DF1
str_id match_val name legacy_id error
id
3 NaN 12345 foo 3 Duplicate match_val
6 ghij 67890 bar 6 NaN
9 NaN None delta 9 NaN
12 NaN None leave 12 NaN