Replace multiple columns' NaNs with other columns' values in Pandas

Question

Given a dataframe as follows:

      date city  gdp  gdp1  gdp2  gross domestic product  pop  pop1  pop2
0  2001-03   bj  3.0   NaN   NaN                     NaN  7.0   NaN   NaN
1  2001-06   bj  5.0   NaN   NaN                     NaN  6.0   6.0   NaN
2  2001-09   bj  8.0   NaN   NaN                     8.0  4.0   4.0   NaN
3  2001-12   bj  7.0   NaN   7.0                     NaN  2.0   NaN   2.0
4  2001-03   sh  4.0   4.0   NaN                     NaN  3.0   NaN   NaN
5  2001-06   sh  5.0   NaN   NaN                     5.0  5.0   5.0   NaN
6  2001-09   sh  9.0   NaN   NaN                     NaN  4.0   4.0   NaN
7  2001-12   sh  3.0   3.0   NaN                     NaN  6.0   NaN   6.0

I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively.

      date city  gdp  pop
0  2001-03   bj    3    7
1  2001-06   bj    5    6
2  2001-09   bj    8    4
3  2001-12   bj    7    2
4  2001-03   sh    4    3
5  2001-06   sh    5    5
6  2001-09   sh    9    4
7  2001-12   sh    3    6

The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns?

df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product']
df.loc[df['pop'].isnull(), 'pop'] = df['pop1']
df.loc[df['pop'].isnull(), 'pop'] = df['pop2']
df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)

jezrael · Accepted Answer · 2020-04-17 09:56:40Z

2

Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side:

#if first column is gdp, pop
df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp']
df['pop'] = df.filter(like='pop').bfill(axis=1)['pop']

#if possible any first column
df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0]
df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0]

But if only one non missing values is posible use max, min...:

df['gdp'] = df.filter(like='gdp').max(axis=1)
df['pop'] = df.filter(like='pop').max(axis=1)

If need specify columns names by list:

gdp_c = ['gdp1','gdp2','gross domestic product']
pop_c = ['pop1','pop2']
df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0]
df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0]

df = df[['date','city','gdp','pop']]
print (df)

      date city  gdp  pop
0  2001-03   bj  3.0  7.0
1  2001-06   bj  5.0  6.0
2  2001-09   bj  8.0  4.0
3  2001-12   bj  7.0  2.0
4  2001-03   sh  4.0  3.0
5  2001-06   sh  5.0  5.0
6  2001-09   sh  9.0  4.0
7  2001-12   sh  3.0  6.0

edited Apr 17, 2020 at 9:56

answered Apr 17, 2020 at 9:41

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Vishnudev Krishnadas Over a year ago

Why not df.filter(like='gdp').bfill(axis=1)['gdp']?

ah bon Over a year ago

Sorry, filter is a good idea, you may keep it in aswer. But in my data there gdp type columns which have no similarity with from words. Please check the update of example data.

ah bon Over a year ago

So I need set columns name which need to replace for gdp and pop instead of using filter method.

jezrael Over a year ago

@ahbon - I think this order is by list of columns names, e.g. gdp_c = ['gdp1','gdp2','gross domestic product'] then if select then gross domestic product is last one.

ah bon Over a year ago

You're right, not necessary to set order for df at first step. :)

|

Collectives™ on Stack Overflow

Replace multiple columns' NaNs with other columns' values in Pandas

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related