How to insert missing data into a pandas dataframe, using another dataframe that has the missing info?

Question

Let's say I have a dataframe of leads as such:

import pandas as pd

leads = {'Unique Identifier':['1','2','3','4','5','6','7','8'],
        'Name': ['brad','stacy','holly','mike','phil', 'chris','jane','glenn'],
        'Channel': [None,None,None,None,'facebook', 'facebook','google', 'facebook'],
        'Campaign': [None,None,None,None,'A', 'B','B', 'C'],
        'Gender': ['M','F','F','M','M', 'M','F','M'],
        'Signup Month':['Mar','Mar','Apr','May','May','May','Jun','Jun']
        }

leads_df = pd.DataFrame(leads)

leads_df

which looks like the following. It has missing data for Channel and Campaign for the first 4 leads.

leads table

I have a separate dataframe with the missing data:

missing = {'Unique Identifier':['1','2','3','4'],
        'Channel': ['google', 'email','facebook', 'google'],
        'Campaign': ['B', 'A','C', 'B']
        }

missing_df = pd.DataFrame(missing)

missing_df

table with missing data

Using the Unique Identifiers in both tables, how would I go about plugging in the missing data into the main leads table? For context there are about 6,000 leads with missing data.

PacketLoss · Accepted Answer · 2020-07-22 23:32:57Z

You can merge the two dataframes together, update the columns using the results from the merge and then proceed to drop the merged columns.

data = leads_df.merge(missing_df, how='outer', on='Unique Identifier')
data['Channel'] = data['Channel_y'].fillna(data['Channel_x'])
data['Campaign'] = data['Campaign_y'].fillna(data['Campaign_x'])
data.drop(['Channel_x', 'Channel_y', 'Campaign_x', 'Campaign_y'], 1, inplace=True)

The result:

data
  Unique Identifier   Name Gender Signup Month   Channel Campaign
0                 1   brad      M          Mar    google        B
1                 2  stacy      F          Mar     email        A
2                 3  holly      F          Apr  facebook        C
3                 4   mike      M          May    google        B
4                 5   phil      M          May  facebook        A
5                 6  chris      M          May  facebook        B
6                 7   jane      F          Jun    google        B
7                 8  glenn      M          Jun  facebook        C

sammywemmy · Accepted Answer · 2020-07-23 00:02:56Z

0

You can set the index of both dataframes to unique identifier and use combine_first to fill the null values in leads_df

(leads_df
.set_index("Unique Identifier")
.combine_first(missing_df.set_index("Unique Identifier"))
.reset_index()
)

answered Jul 23, 2020 at 0:02

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Comments

sangmoo · Accepted Answer · 2020-07-23 01:59:37Z

The way I use in this kind of case is similar to vlookup function of Excel.

leads_df.loc[leads_df.Channel.isna(),'Channel']=pd.merge(leads_df.loc[leads_df.Channel.isna(),'Unique Identifier'],                                             
                                                         missing_df,
                                                         how='left')['Channel']

This code will result in :

  Unique Identifier   Name   Channel Campaign Gender Signup Month
0                 1   brad    google     None      M          Mar
1                 2  stacy     email     None      F          Mar
2                 3  holly  facebook     None      F          Apr
3                 4   mike    google     None      M          May
4                 5   phil  facebook        A      M          May
5                 6  chris  facebook        B      M          May
6                 7   jane    google        B      F          Jun
7                 8  glenn  facebook        C      M          Jun

You can do same to 'Campaign'.

Sandeep Kothari · Accepted Answer · 2020-07-23 02:06:13Z

0

You just need to fill it using fillna() ...

leads_df.fillna(missing_df, inplace=True)

answered Jul 23, 2020 at 2:06

Sandeep Kothari

4153 silver badges6 bronze badges

Comments

Meow · Accepted Answer · 2020-07-23 02:21:17Z

0

There is a pandas DataFrame method for this called combine_first:

voltron = leads_df.combine_first(missing_df)

edited Jul 23, 2020 at 2:21

answered Jul 23, 2020 at 2:14

Meow

1,27516 silver badges24 bronze badges

Collectives™ on Stack Overflow

How to insert missing data into a pandas dataframe, using another dataframe that has the missing info?

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related