2

I try to merge multiple new dataFrames in a main one. Suppose main dataframe:

      key1           key2
0   0.365803    0.259112
1   0.086869    0.589834
2   0.269619    0.183644
3   0.755826    0.045187
4   0.204009    0.669371

And I try to merge the 2 following datasets within the main one,
New data1:

        key1    key2    new feature
0   0.365803    0.259112    info1

New data2:

        key1    key2    new feature
0   0.204009    0.669371    info2

Expected result:

       key1       key2  new feature
0   0.365803    0.259112    info1
1   0.776945    0.780978    NaN
2   0.275891    0.114998    NaN
3   0.667057    0.373029    NaN
4   0.204009    0.669371    info2

What I tried:

test = test.merge(data1, left_on=['key1', 'key2'], right_on=['key1', 'key2'], how='left')
test = test.merge(data2, left_on=['key1', 'key2'], right_on=['key1', 'key2'], how='left')

Works well for the first one, but not for the second, the result I get:

        key1    key2    new feature_x   new feature_y
0   0.365803    0.259112    info1      NaN
1   0.776945    0.780978    NaN        NaN
2   0.275891    0.114998    NaN        NaN
3   0.667057    0.373029    NaN        NaN
4   0.204009    0.669371    NaN       info2

Thanks for your help!

3 Answers 3

2

First append or concat both DataFrames together and then merge:

dat = pd.concat([data1, data2], ignore_index=True)

Or:

dat = data1.append(data2, ignore_index=True)

print (dat)
       key1      key2 new feature
0  0.365803  0.259112       info1
1  0.204009  0.669371       info2

#if same joined columns names better is only on parameter
df = test.merge(dat, on=['key1', 'key2'], how='left')

print (df)
       key1      key2 new feature
0  0.365803  0.259112       info1
1  0.086869  0.589834         NaN
2  0.269619  0.183644         NaN
3  0.755826  0.045187         NaN
4  0.204009  0.669371       info2
Sign up to request clarification or add additional context in comments.

Comments

0

You can use pd.DataFrame.update instead:

# create new column and set index
res = test.assign(newfeature=None).set_index(['key1', 'key2'])

# update with new data sequentially
res.update(data1.set_index(['key1', 'key2']))
res.update(data2.set_index(['key1', 'key2']))

# reset index to recover columns
res = res.reset_index()

print(res)

       key1      key2 newfeature
0  0.365803  0.259112      info1
1  0.086869  0.589834       None
2  0.269619  0.183644       None
3  0.755826  0.045187       None
4  0.204009  0.669371      info2

Comments

0

You can also set the data frames to the same index and use simple loc

df  = df.set_index(["key1", "key2"])
df2 = df2.set_index(["key1", "key2"])

Then

df.loc[:, "new_feature"] = df2['new_feature']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.