0

I have a dataset that has been merged together to fill missing values from one another.

The problem is that I have some columns with missing data that I want to now fill with the values that aren't missing.

The merged data set looks like this for an input:

Name         State       ID       Number_x      Number_y       Op_x       Op_y
Johnson      AL          1        1             nan            1956       nan
Johnson      AL          1        nan           nan            1956       nan
Johnson      AL          2        1             nan            1999       nan
Johnson      AL          2        0             nan            1999       nan
Debra        AK          1A       0             nan            2000       nan
Debra        AK          1B       nan           20             nan        1997
Debra        AK          2        nan           10             nan        2009
Debra        AK          3        nan           1              nan        2008
.
.

What I'd want for an output is this:

Name         State       ID       Number_x      Number_y     Op_x       Op_y
Johnson      AL          1        1             1            1956       1956
Johnson      AL          2        1             1            1999       1999
Johnson      AL          2        0             0            1999       1999
Debra        AK          1A       0             0            2000       2000
Debra        AK          1B       20            20           1997       1997
Debra        AK          2        10            10           2009       2009
Debra        AK          3        1             1            2008       2008
.
.

So I want it so that all nan values are replaced by the associated values in their columns - match Number_x to Number_y and Op_x to Op_y.

One thing to note is that when there are two IDs that are the same sometimes their values will be different; like Johnson with ID = 2 which has different numbers but the same op values. I want to keep these because I need to investigate them more.

Also, if the row has two missing values for Number_x and Number_y I want to take that row out - like Johnson with Number_x and Number_y missing as a nan value.

4
  • why the 2nd last row one is 1 and the other is 10 ? Commented Jan 21, 2019 at 17:01
  • Sorry - correction made. Thank you. Commented Jan 21, 2019 at 17:02
  • also you have duplicated column in out put , _x and _y are the same ? Commented Jan 21, 2019 at 17:04
  • what about df.loc[df.isnull().any(axis=1), :] = df.ffill() Commented Jan 21, 2019 at 17:04

1 Answer 1

2

let us do groupby with axis =1

df.groupby(df.columns.str.split('_').str[0],1).first().dropna(subset=['Number','Op'])
   ID     Name  Number      Op State
0   1  Johnson     1.0  1956.0    AL
2   2  Johnson     1.0  1999.0    AL
3   2  Johnson     0.0  1999.0    AL
4  1A    Debra     0.0  2000.0    AK
5  1B    Debra    20.0  1997.0    AK
6   2    Debra    10.0  2009.0    AK
7   3    Debra     1.0  2008.0    AK
Sign up to request clarification or add additional context in comments.

3 Comments

astype(int) for ['Number','Op'].
@pygo the original df should be float , that is why I keep it here. And that is just partial df.
@ W-B, ok nice solution, +1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.