Combine pandas DataFrame rows with overlapping data

Question

I have a dataset where on occasion a row of data ends up being split across two rows of a Dataframe. I am able to isolate these rows and would like to combine the two rows together, unfortunately it is not always straight forward as the data is assigned upstream on a first in basis to the DB.

Example df:

import pandas as pd
import numpy as np

d = {'id':[1,2,2,3,3],
     'a1':[1,1,np.NaN,1,np.NaN],
     'b1':[1,2,np.NaN,2,np.NaN],
     'c1':[1,3,np.NaN,3,5],
     'c2':[1,4,np.NaN,4,6],
     'c3':[1,np.NaN,5,np.NaN,7],
     'c4':[1,np.NaN,6,np.NaN,np.NaN],
     'c5':[1,np.NaN,7,np.NaN,np.NaN],
     'd1':[1,np.NaN,8,np.NaN,8]}

df = pd.DataFrame(d)

    id  a1  b1  c1  c2  c3  c4  c5  d1
0   1   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1   2   1.0 2.0 3.0 4.0 NaN NaN NaN NaN
2   2   NaN NaN NaN NaN 5.0 6.0 7.0 8.0
3   3   1.0 2.0 3.0 4.0 NaN NaN NaN NaN
4   3   NaN NaN 5.0 6.0 7.0 NaN NaN 8.0

For clarity duplicates are on 'id'. And the intention is to have a single row per ID within df.

Scenario 1 (ID = 2): index 2 values 5,6,7,8 would replace the NaN values in index 1 directly above.

Scenario 2 (ID = 3): index 4 values 5,6,7 would need to move to columns C3, C4, C5 within index 3 and value 8 (d1) would replace the NaN value in d1/index 3

This occurs multiple times within a large dataset and varies about the column in which the split happens so I would ideally need to be able to do this dynamically where a duplicate ID is found.

EDIT: For Additional Scenario

d = {'id':[1,4,4],
 'a1':[1,1,np.NaN],
 'b1':[2,2,np.NaN],
 'c1':[3,3,5],
 'c2':[4,4,6],
 'c3':[5,np.NaN,np.NaN],
 'c4':[6,np.NaN,np.NaN],
 'c5':[7,np.NaN,np.NaN],
 'd1':[8,np.NaN,8]}

df2 = pd.DataFrame(d)

    id  a1  b1  c1  c2  c3  c4  c5  d1
0   1   1.0 2.0 3   4   5.0 6.0 7.0 8.0
1   4   1.0 2.0 3   4   NaN NaN NaN NaN
2   4   NaN NaN 5   6   NaN NaN NaN 8.0

Scenario 3 (ID = 4): the same as scenario 2 however c5 (id4) would remain as a NaN value in this case.

do you always have the perfect number of values and all the rest NaNs? — mozway
– mozway, Commented Feb 1, 2022 at 15:23
@mozway no I don't always have the exact number of values, you should typically get at least one value for each alpha column e.g. [a1,b1,c1,d1] then you could get any number of values in this case for c. It is typical that you may only have a handful of rows in a DF that has values for all columns and then any number of rows with values in between c1 and c(n) — SiDoesCoding
– SiDoesCoding, Commented Feb 1, 2022 at 15:38
can you update your example to reflect this? and provide the details on which column has which role? — mozway
– mozway, Commented Feb 1, 2022 at 15:38
How do you decide on which columns should be unshifted? Will it always be d1? or is it based on pattern? can it vary among groups? — mozway
– mozway, Commented Feb 1, 2022 at 16:04

mozway · Accepted Answer · 2022-02-01 19:12:54Z

ideal number of values to fill the NaNs

Assuming you always have the ideal number of values, you can flatten per group without NaNs:

(df.set_index('id')
   .groupby(level=0)
   .apply(lambda d: d.stack().dropna().set_axis(df.columns[1:]))
   .reset_index()
)

output:

   id   a1   b1   c1   c2   c3   c4   c5   d1
0   1  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
1   2  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
2   3  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0

alternative for an arbitrary number of "id"/grouping columns:

id_cols = ['id']
(df.set_index(id_cols)
   .groupby(level=range(len(id_cols)))
   .apply(lambda d: d.stack().dropna().set_axis(list(df.drop(columns=id_cols))))
   .reset_index()
)

filling values per group of columns

In this alternative the columns are grouped per Xn where X is a non-digit word and n the sublevel.

Here I used a double groupby, once on the rows, once on the columns:

def flat(d):
    return (d.groupby(d.columns.str.extract('(^\D+)', expand=False), axis=1)
             .apply(lambda d: pd.DataFrame([sorted(d.values.ravel(),
                                                   key=pd.isna)[:len(d.columns)]],
                                           columns=d.columns)
                    )).iloc[0]

(df2.set_index('id')
   .groupby(level=0)
   .apply(flat)
   .reset_index()
)

output:

0  id   a1   b1   c1   c2   c3   c4   c5   d1
0   1  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
1   4  1.0  2.0  3.0  4.0  5.0  6.0  NaN  8.0

Collectives™ on Stack Overflow

Combine pandas DataFrame rows with overlapping data

1 Answer 1

ideal number of values to fill the NaNs

alternative for an arbitrary number of "id"/grouping columns:

filling values per group of columns

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

ideal number of values to fill the NaNs

alternative for an arbitrary number of "id"/grouping columns:

filling values per group of columns

Comments

Your Answer

Sign up or log in

Post as a guest

Related