I have a dataset where on occasion a row of data ends up being split across two rows of a Dataframe. I am able to isolate these rows and would like to combine the two rows together, unfortunately it is not always straight forward as the data is assigned upstream on a first in basis to the DB.
Example df:
import pandas as pd
import numpy as np
d = {'id':[1,2,2,3,3],
'a1':[1,1,np.NaN,1,np.NaN],
'b1':[1,2,np.NaN,2,np.NaN],
'c1':[1,3,np.NaN,3,5],
'c2':[1,4,np.NaN,4,6],
'c3':[1,np.NaN,5,np.NaN,7],
'c4':[1,np.NaN,6,np.NaN,np.NaN],
'c5':[1,np.NaN,7,np.NaN,np.NaN],
'd1':[1,np.NaN,8,np.NaN,8]}
df = pd.DataFrame(d)
id a1 b1 c1 c2 c3 c4 c5 d1
0 1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 2 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
2 2 NaN NaN NaN NaN 5.0 6.0 7.0 8.0
3 3 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
4 3 NaN NaN 5.0 6.0 7.0 NaN NaN 8.0
For clarity duplicates are on 'id'. And the intention is to have a single row per ID within df.
Scenario 1 (ID = 2): index 2 values 5,6,7,8 would replace the NaN values in index 1 directly above.
Scenario 2 (ID = 3): index 4 values 5,6,7 would need to move to columns C3, C4, C5 within index 3 and value 8 (d1) would replace the NaN value in d1/index 3
This occurs multiple times within a large dataset and varies about the column in which the split happens so I would ideally need to be able to do this dynamically where a duplicate ID is found.
EDIT: For Additional Scenario
d = {'id':[1,4,4],
'a1':[1,1,np.NaN],
'b1':[2,2,np.NaN],
'c1':[3,3,5],
'c2':[4,4,6],
'c3':[5,np.NaN,np.NaN],
'c4':[6,np.NaN,np.NaN],
'c5':[7,np.NaN,np.NaN],
'd1':[8,np.NaN,8]}
df2 = pd.DataFrame(d)
id a1 b1 c1 c2 c3 c4 c5 d1
0 1 1.0 2.0 3 4 5.0 6.0 7.0 8.0
1 4 1.0 2.0 3 4 NaN NaN NaN NaN
2 4 NaN NaN 5 6 NaN NaN NaN 8.0
Scenario 3 (ID = 4): the same as scenario 2 however c5 (id4) would remain as a NaN value in this case.