0

Given a dataframe as follows:

      date city  gdp  gdp1  gdp2  gross domestic product  pop  pop1  pop2
0  2001-03   bj  3.0   NaN   NaN                     NaN  7.0   NaN   NaN
1  2001-06   bj  5.0   NaN   NaN                     NaN  6.0   6.0   NaN
2  2001-09   bj  8.0   NaN   NaN                     8.0  4.0   4.0   NaN
3  2001-12   bj  7.0   NaN   7.0                     NaN  2.0   NaN   2.0
4  2001-03   sh  4.0   4.0   NaN                     NaN  3.0   NaN   NaN
5  2001-06   sh  5.0   NaN   NaN                     5.0  5.0   5.0   NaN
6  2001-09   sh  9.0   NaN   NaN                     NaN  4.0   4.0   NaN
7  2001-12   sh  3.0   3.0   NaN                     NaN  6.0   NaN   6.0

I want to replace NaNs from gdp and pop with values of gdp1, gdp2, gross domestic product and pop1, pop2 respectively.

      date city  gdp  pop
0  2001-03   bj    3    7
1  2001-06   bj    5    6
2  2001-09   bj    8    4
3  2001-12   bj    7    2
4  2001-03   sh    4    3
5  2001-06   sh    5    5
6  2001-09   sh    9    4
7  2001-12   sh    3    6

The following code works, but I wonder if it's possible to make it more concise, since I have many similar columns?

df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp1']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gdp2']
df.loc[df['gdp'].isnull(), 'gdp'] = df['gross domestic product']
df.loc[df['pop'].isnull(), 'pop'] = df['pop1']
df.loc[df['pop'].isnull(), 'pop'] = df['pop2']
df.drop(['gdp1', 'gdp2', 'gross domestic product', 'pop1', 'pop2'], axis=1)

1 Answer 1

2

Idea is use back filling missing values filtered by DataFrame.filter, if possible more values per group then is prioritize columns from left side, if change .bfill(axis=1).iloc[:, 0] to .ffill(axis=1).iloc[:, -1] then is prioritize columns from right side:

#if first column is gdp, pop
df['gdp'] = df.filter(like='gdp').bfill(axis=1)['gdp']
df['pop'] = df.filter(like='pop').bfill(axis=1)['pop']

#if possible any first column
df['gdp'] = df.filter(like='gdp').bfill(axis=1).iloc[:, 0]
df['pop'] = df.filter(like='pop').bfill(axis=1).iloc[:, 0]

But if only one non missing values is posible use max, min...:

df['gdp'] = df.filter(like='gdp').max(axis=1)
df['pop'] = df.filter(like='pop').max(axis=1)

If need specify columns names by list:

gdp_c = ['gdp1','gdp2','gross domestic product']
pop_c = ['pop1','pop2']
df['gdp'] = df[gdp_c].bfill(axis=1).iloc[:, 0]
df['pop'] = df[pop_c].bfill(axis=1).iloc[:, 0]

df = df[['date','city','gdp','pop']]
print (df)

      date city  gdp  pop
0  2001-03   bj  3.0  7.0
1  2001-06   bj  5.0  6.0
2  2001-09   bj  8.0  4.0
3  2001-12   bj  7.0  2.0
4  2001-03   sh  4.0  3.0
5  2001-06   sh  5.0  5.0
6  2001-09   sh  9.0  4.0
7  2001-12   sh  3.0  6.0
Sign up to request clarification or add additional context in comments.

6 Comments

Why not df.filter(like='gdp').bfill(axis=1)['gdp']?
Sorry, filter is a good idea, you may keep it in aswer. But in my data there gdp type columns which have no similarity with from words. Please check the update of example data.
So I need set columns name which need to replace for gdp and pop instead of using filter method.
@ahbon - I think this order is by list of columns names, e.g. gdp_c = ['gdp1','gdp2','gross domestic product'] then if select then gross domestic product is last one.
You're right, not necessary to set order for df at first step. :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.