Match two separate dataframes to a larger dataframe based on matching values (In Python)

Question

I have 1 large dataframe, and 2 smaller dataframes in which I would like to append/match based on certain criteria.

Data

df1 (large dataframe)

id  Date    pp  pos
aa  q122    200 10
aa  q222    200 10
bb  q322    500 5
bb  q422    500 5
cc  q122    100 2
cc  q222    100 2

df2

name    date1   count1  pp1
aa      q122    3       30
aa      q222    5       10

df3

ex  date2   count2  pp2
cc  q122    3       30
cc  q222    5       10

Desired

id  Date    pp  pos name    date1   count1  pwr1    ex  date2   count2  pwr2
aa  q122    200 10  aa      q122    3       30      NaN NaN     0       0
aa  q222    200 10  aa      q222    5       10      NaN NaN     0       0
bb  q322    500 5   NaN     NaN     0       0       NaN NaN     0       0
bb  q422    500 5   NaN     NaN     0       0       NaN NaN     0       0
cc  q122    100 2   NaN     NaN     0       0       cc  q122    3       30
cc  q222    100 2   NaN     NaN     0       0       cc  q222    5       10

Logic: I am matching the individual dataframes based on whether the 'name' and 'ex' values match the 'id' value as well as the 'date'

Doing

df1['id'] = df1['name'].combine_first(df1['ex'])


out = df2.merge(df1, on=['id', 'date'], how='outer')

But getting a little lost on how to incorporate the 3rd dataframe Any suggestion is appreciated

Henry Ecker · Accepted Answer · 2021-08-20 23:35:46Z

2

We can chain merge operations:

out = (
    df1.merge(
        df2, left_on=['id', 'Date'], right_on=['name', 'date1'], how='outer'
    ).merge(
        df3, left_on=['id', 'Date'], right_on=['ex', 'date2'], how='outer'
    )
)

out:

   id  Date   pp  pos name date1  count1   pp1   ex date2  count2   pp2
0  aa  q122  200   10   aa  q122     3.0  30.0  NaN   NaN     NaN   NaN
1  aa  q222  200   10   aa  q222     5.0  10.0  NaN   NaN     NaN   NaN
2  bb  q322  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
3  bb  q422  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
4  cc  q122  100    2  NaN   NaN     NaN   NaN   cc  q122     3.0  30.0
5  cc  q222  100    2  NaN   NaN     NaN   NaN   cc  q222     5.0  10.0

With DataFrame.fillna and DataFrame.rename to match exact output by filling certain columns with 0 and downcasting to int (if possible) and modifying column names:

out = (
    df1.merge(
        df2, left_on=['id', 'Date'], right_on=['name', 'date1'], how='outer'
    ).merge(
        df3, left_on=['id', 'Date'], right_on=['ex', 'date2'], how='outer'
    ).rename(
        columns={'pp1': 'pwr1', 'pp2': 'pwr2'}
    ).fillna(
        {'count1': 0, 'pwr1': 0, 'count2': 0, 'pwr2': 0}, downcast='infer'
    )
)

out:

   id  Date   pp  pos name date1  count1  pwr1   ex date2  count2  pwr2
0  aa  q122  200   10   aa  q122       3    30  NaN   NaN       0     0
1  aa  q222  200   10   aa  q222       5    10  NaN   NaN       0     0
2  bb  q322  500    5  NaN   NaN       0     0  NaN   NaN       0     0
3  bb  q422  500    5  NaN   NaN       0     0  NaN   NaN       0     0
4  cc  q122  100    2  NaN   NaN       0     0   cc  q122       3    30
5  cc  q222  100    2  NaN   NaN       0     0   cc  q222       5    10

DataFrames and imports:

import pandas as pd

df1 = pd.DataFrame({
    'id': ['aa', 'aa', 'bb', 'bb', 'cc', 'cc'],
    'Date': ['q122', 'q222', 'q322', 'q422', 'q122', 'q222'],
    'pp': [200, 200, 500, 500, 100, 100],
    'pos': [10, 10, 5, 5, 2, 2]
})

df2 = pd.DataFrame({
    'name': ['aa', 'aa'],
    'date1': ['q122', 'q222'],
    'count1': [3, 5],
    'pp1': [30, 10]
})

df3 = pd.DataFrame({
    'ex': ['cc', 'cc'],
    'date2': ['q122', 'q222'],
    'count2': [3, 5],
    'pp2': [30, 10]
})

edited Aug 20, 2021 at 23:35

answered Aug 20, 2021 at 23:31

Henry Ecker♦

35.9k19 gold badges48 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Lynn Over a year ago

thank you let me try this suggestion - what is the downcast infer? is this just filling the remaining columns w values

Henry Ecker Over a year ago

NaN is of type float (and columns must be of a single type). If you just fillna you'll end up with 0.0 (float) since the column contains whole numbers this is saying reduce the size (if possible) to the smallest supported type. Which will give you int in this case. If your actual data is of type float you can skip it. @Lynn

Lynn Over a year ago

ok this makes sense- thank you for your knowledge

not_speshal · Accepted Answer · 2021-08-20 23:31:11Z

Do it in two stages:

merged_1 = pd.merge(df1, df2, left_on=["id", "Date"], right_on=["name", "date1"], how="outer")
merged = pd.merge(merged_1, df3, left_on=["id", "Date"], right_on=["ex", "date2"], how="outer")

>>> merged
   id  Date   pp  pos name date1  count1   pp1   ex date2  count2   pp2
0  aa  q122  200   10   aa  q122     3.0  30.0  NaN   NaN     NaN   NaN
1  aa  q222  200   10   aa  q222     5.0  10.0  NaN   NaN     NaN   NaN
2  bb  q322  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
3  bb  q422  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
4  cc  q122  100    2  NaN   NaN     NaN   NaN   cc  q122     3.0  30.0
5  cc  q222  100    2  NaN   NaN     NaN   NaN   cc  q222     5.0  10.0

wwnde · Accepted Answer · 2021-08-20 23:32:02Z

1

Must show you multiple object ids.?

If not I would use join. Code below

df1.set_index('id').join(df2.set_index('name')).join(df3.set_index('ex'))

answered Aug 20, 2021 at 23:32

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Collectives™ on Stack Overflow

Match two separate dataframes to a larger dataframe based on matching values (In Python)

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related