Merging dataframes based on a column value

Question

I have 2 large data frames, below 2 are just examples of how those 2 would look like.

df1 = pd.DataFrame(columns=['node', 'st1', 'st2'], data=[['a', 1, -1], ['b', 2, 2], ['c', 3, 4]])

node  st1  st2 
 a    1   -1
 b    2    2
 c    3    4

df2 = pd.DataFrame(columns=['node', 'st1', 'st2'], data=[['a', 8, 5], ['b', 4, 6]])

node  st1  st2
 a    8    5
 b    4    6

I want to update df1, st1 and st2, column values with the df2, st1 and st2, column values only if the node names in both data frames match. ALSO, if st1 or st2 columns values in df1 equals -1, do not update for that row and column i.e. keep it as -1. The result would look something like,

node  st1  st2
 a     8   -1
 b     4    6
 c     3    4

I've tried merging the 2 data frames using the basic pandas merge with left join which would give me a df with duplicate columns, then looping through each row in the resulting df to check the values of st1 and st2, and replace them only if it is not -1. But this would take a lot of time in larger data frames, which is why I would like to find the most effective way to do this.

yatu · Accepted Answer · 2019-02-15 15:38:16Z

3

You could set node as index in both dataframes, set to NaN all values except -1s and use DataFrame.combine_firstto fill NaNs in df1 with the values in df2 with shared index:

df = df1.set_index('node')
df.where(df.eq(-1)).combine_first(df2.set_index('node')).fillna(df)

      st1  st2
node          
a     8.0 -1.0
b     4.0  6.0
c     3.0  4.0

edited Feb 15, 2019 at 15:38

answered Feb 15, 2019 at 15:28

yatu

88.7k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

BallpointBen Over a year ago

This unnecessarily converts to float

yatu Over a year ago

Yes, because it converts to Nan at some point. Which is treated as a float. Can easily be converted to int

busybear · Accepted Answer · 2019-02-15 15:50:29Z

1

One way is to index where -1 appears and then go ahead and merge all data into df1 from df2. Then replace your -1 values (here I'm actually replacing the non -1 values with the new values). You'll need to set the index as node for this to work:

df1 = df1.set_index('node')
df2 = df2.set_index('node')

no_repl = df1 == -1
new_df = df2.combine_first(df1)
new_df = df1.where(no_repl, new_df).reset_index()

Same idea as @yatu's post really. Just slightly different syntax.

edited Feb 15, 2019 at 15:50

answered Feb 15, 2019 at 15:31

busybear

10.7k1 gold badge29 silver badges44 bronze badges

2 Comments

yatu Over a year ago

This is not merging on node. combine_first will use the indices to do so, but for that node has to be set to index

busybear Over a year ago

Good catch. Yes I forgot to set the index. Actually looking at your answer, it's pretty much the same idea +1.

BallpointBen · Accepted Answer · 2019-02-15 15:56:38Z

0

df3 = df1.set_index('node')
df4 = df2.set_index('node')
keep_loc = (df3 == -1) | ~df3.index.isin(df4.index)[:, np.newaxis]
df3.where(keep_loc, df4)

      st1  st2
node          
a       8   -1
b       4    6
c       3    4

answered Feb 15, 2019 at 15:56

BallpointBen

15.6k2 gold badges46 silver badges81 bronze badges

Collectives™ on Stack Overflow

Merging dataframes based on a column value

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related