0

Let's say I have a dataframe of leads as such:

import pandas as pd

leads = {'Unique Identifier':['1','2','3','4','5','6','7','8'],
        'Name': ['brad','stacy','holly','mike','phil', 'chris','jane','glenn'],
        'Channel': [None,None,None,None,'facebook', 'facebook','google', 'facebook'],
        'Campaign': [None,None,None,None,'A', 'B','B', 'C'],
        'Gender': ['M','F','F','M','M', 'M','F','M'],
        'Signup Month':['Mar','Mar','Apr','May','May','May','Jun','Jun']
        }

leads_df = pd.DataFrame(leads)

leads_df

which looks like the following. It has missing data for Channel and Campaign for the first 4 leads.

leads table

I have a separate dataframe with the missing data:

missing = {'Unique Identifier':['1','2','3','4'],
        'Channel': ['google', 'email','facebook', 'google'],
        'Campaign': ['B', 'A','C', 'B']
        }

missing_df = pd.DataFrame(missing)

missing_df

table with missing data

Using the Unique Identifiers in both tables, how would I go about plugging in the missing data into the main leads table? For context there are about 6,000 leads with missing data.

5 Answers 5

1

You can merge the two dataframes together, update the columns using the results from the merge and then proceed to drop the merged columns.

data = leads_df.merge(missing_df, how='outer', on='Unique Identifier')
data['Channel'] = data['Channel_y'].fillna(data['Channel_x'])
data['Campaign'] = data['Campaign_y'].fillna(data['Campaign_x'])
data.drop(['Channel_x', 'Channel_y', 'Campaign_x', 'Campaign_y'], 1, inplace=True)

The result:

data
  Unique Identifier   Name Gender Signup Month   Channel Campaign
0                 1   brad      M          Mar    google        B
1                 2  stacy      F          Mar     email        A
2                 3  holly      F          Apr  facebook        C
3                 4   mike      M          May    google        B
4                 5   phil      M          May  facebook        A
5                 6  chris      M          May  facebook        B
6                 7   jane      F          Jun    google        B
7                 8  glenn      M          Jun  facebook        C
Sign up to request clarification or add additional context in comments.

Comments

0

You can set the index of both dataframes to unique identifier and use combine_first to fill the null values in leads_df

(leads_df
.set_index("Unique Identifier")
.combine_first(missing_df.set_index("Unique Identifier"))
.reset_index()
)

Comments

0

The way I use in this kind of case is similar to vlookup function of Excel.

leads_df.loc[leads_df.Channel.isna(),'Channel']=pd.merge(leads_df.loc[leads_df.Channel.isna(),'Unique Identifier'],                                             
                                                         missing_df,
                                                         how='left')['Channel']

This code will result in :

  Unique Identifier   Name   Channel Campaign Gender Signup Month
0                 1   brad    google     None      M          Mar
1                 2  stacy     email     None      F          Mar
2                 3  holly  facebook     None      F          Apr
3                 4   mike    google     None      M          May
4                 5   phil  facebook        A      M          May
5                 6  chris  facebook        B      M          May
6                 7   jane    google        B      F          Jun
7                 8  glenn  facebook        C      M          Jun

You can do same to 'Campaign'.

Comments

0

You just need to fill it using fillna() ...

leads_df.fillna(missing_df, inplace=True)

Comments

0

There is a pandas DataFrame method for this called combine_first:

voltron = leads_df.combine_first(missing_df)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.