1

I have 1 large dataframe, and 2 smaller dataframes in which I would like to append/match based on certain criteria.

Data

df1 (large dataframe)

id  Date    pp  pos
aa  q122    200 10
aa  q222    200 10
bb  q322    500 5
bb  q422    500 5
cc  q122    100 2
cc  q222    100 2

df2

name    date1   count1  pp1
aa      q122    3       30
aa      q222    5       10

df3

ex  date2   count2  pp2
cc  q122    3       30
cc  q222    5       10

Desired

id  Date    pp  pos name    date1   count1  pwr1    ex  date2   count2  pwr2
aa  q122    200 10  aa      q122    3       30      NaN NaN     0       0
aa  q222    200 10  aa      q222    5       10      NaN NaN     0       0
bb  q322    500 5   NaN     NaN     0       0       NaN NaN     0       0
bb  q422    500 5   NaN     NaN     0       0       NaN NaN     0       0
cc  q122    100 2   NaN     NaN     0       0       cc  q122    3       30
cc  q222    100 2   NaN     NaN     0       0       cc  q222    5       10
                                
                                

Logic: I am matching the individual dataframes based on whether the 'name' and 'ex' values match the 'id' value as well as the 'date'

Doing

df1['id'] = df1['name'].combine_first(df1['ex'])


out = df2.merge(df1, on=['id', 'date'], how='outer')

But getting a little lost on how to incorporate the 3rd dataframe Any suggestion is appreciated

3 Answers 3

2

We can chain merge operations:

out = (
    df1.merge(
        df2, left_on=['id', 'Date'], right_on=['name', 'date1'], how='outer'
    ).merge(
        df3, left_on=['id', 'Date'], right_on=['ex', 'date2'], how='outer'
    )
)

out:

   id  Date   pp  pos name date1  count1   pp1   ex date2  count2   pp2
0  aa  q122  200   10   aa  q122     3.0  30.0  NaN   NaN     NaN   NaN
1  aa  q222  200   10   aa  q222     5.0  10.0  NaN   NaN     NaN   NaN
2  bb  q322  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
3  bb  q422  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
4  cc  q122  100    2  NaN   NaN     NaN   NaN   cc  q122     3.0  30.0
5  cc  q222  100    2  NaN   NaN     NaN   NaN   cc  q222     5.0  10.0

With DataFrame.fillna and DataFrame.rename to match exact output by filling certain columns with 0 and downcasting to int (if possible) and modifying column names:

out = (
    df1.merge(
        df2, left_on=['id', 'Date'], right_on=['name', 'date1'], how='outer'
    ).merge(
        df3, left_on=['id', 'Date'], right_on=['ex', 'date2'], how='outer'
    ).rename(
        columns={'pp1': 'pwr1', 'pp2': 'pwr2'}
    ).fillna(
        {'count1': 0, 'pwr1': 0, 'count2': 0, 'pwr2': 0}, downcast='infer'
    )
)

out:

   id  Date   pp  pos name date1  count1  pwr1   ex date2  count2  pwr2
0  aa  q122  200   10   aa  q122       3    30  NaN   NaN       0     0
1  aa  q222  200   10   aa  q222       5    10  NaN   NaN       0     0
2  bb  q322  500    5  NaN   NaN       0     0  NaN   NaN       0     0
3  bb  q422  500    5  NaN   NaN       0     0  NaN   NaN       0     0
4  cc  q122  100    2  NaN   NaN       0     0   cc  q122       3    30
5  cc  q222  100    2  NaN   NaN       0     0   cc  q222       5    10

DataFrames and imports:

import pandas as pd

df1 = pd.DataFrame({
    'id': ['aa', 'aa', 'bb', 'bb', 'cc', 'cc'],
    'Date': ['q122', 'q222', 'q322', 'q422', 'q122', 'q222'],
    'pp': [200, 200, 500, 500, 100, 100],
    'pos': [10, 10, 5, 5, 2, 2]
})

df2 = pd.DataFrame({
    'name': ['aa', 'aa'],
    'date1': ['q122', 'q222'],
    'count1': [3, 5],
    'pp1': [30, 10]
})

df3 = pd.DataFrame({
    'ex': ['cc', 'cc'],
    'date2': ['q122', 'q222'],
    'count2': [3, 5],
    'pp2': [30, 10]
})
Sign up to request clarification or add additional context in comments.

3 Comments

thank you let me try this suggestion - what is the downcast infer? is this just filling the remaining columns w values
NaN is of type float (and columns must be of a single type). If you just fillna you'll end up with 0.0 (float) since the column contains whole numbers this is saying reduce the size (if possible) to the smallest supported type. Which will give you int in this case. If your actual data is of type float you can skip it. @Lynn
ok this makes sense- thank you for your knowledge
1

Do it in two stages:

merged_1 = pd.merge(df1, df2, left_on=["id", "Date"], right_on=["name", "date1"], how="outer")
merged = pd.merge(merged_1, df3, left_on=["id", "Date"], right_on=["ex", "date2"], how="outer")

>>> merged
   id  Date   pp  pos name date1  count1   pp1   ex date2  count2   pp2
0  aa  q122  200   10   aa  q122     3.0  30.0  NaN   NaN     NaN   NaN
1  aa  q222  200   10   aa  q222     5.0  10.0  NaN   NaN     NaN   NaN
2  bb  q322  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
3  bb  q422  500    5  NaN   NaN     NaN   NaN  NaN   NaN     NaN   NaN
4  cc  q122  100    2  NaN   NaN     NaN   NaN   cc  q122     3.0  30.0
5  cc  q222  100    2  NaN   NaN     NaN   NaN   cc  q222     5.0  10.0

Comments

1

Must show you multiple object ids.?

If not I would use join. Code below

df1.set_index('id').join(df2.set_index('name')).join(df3.set_index('ex'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.