1

I have a dataset where on occasion a row of data ends up being split across two rows of a Dataframe. I am able to isolate these rows and would like to combine the two rows together, unfortunately it is not always straight forward as the data is assigned upstream on a first in basis to the DB.

Example df:

import pandas as pd
import numpy as np

d = {'id':[1,2,2,3,3],
     'a1':[1,1,np.NaN,1,np.NaN],
     'b1':[1,2,np.NaN,2,np.NaN],
     'c1':[1,3,np.NaN,3,5],
     'c2':[1,4,np.NaN,4,6],
     'c3':[1,np.NaN,5,np.NaN,7],
     'c4':[1,np.NaN,6,np.NaN,np.NaN],
     'c5':[1,np.NaN,7,np.NaN,np.NaN],
     'd1':[1,np.NaN,8,np.NaN,8]}

df = pd.DataFrame(d)

    id  a1  b1  c1  c2  c3  c4  c5  d1
0   1   1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1   2   1.0 2.0 3.0 4.0 NaN NaN NaN NaN
2   2   NaN NaN NaN NaN 5.0 6.0 7.0 8.0
3   3   1.0 2.0 3.0 4.0 NaN NaN NaN NaN
4   3   NaN NaN 5.0 6.0 7.0 NaN NaN 8.0

For clarity duplicates are on 'id'. And the intention is to have a single row per ID within df.

Scenario 1 (ID = 2): index 2 values 5,6,7,8 would replace the NaN values in index 1 directly above.

Scenario 2 (ID = 3): index 4 values 5,6,7 would need to move to columns C3, C4, C5 within index 3 and value 8 (d1) would replace the NaN value in d1/index 3

This occurs multiple times within a large dataset and varies about the column in which the split happens so I would ideally need to be able to do this dynamically where a duplicate ID is found.

EDIT: For Additional Scenario

d = {'id':[1,4,4],
 'a1':[1,1,np.NaN],
 'b1':[2,2,np.NaN],
 'c1':[3,3,5],
 'c2':[4,4,6],
 'c3':[5,np.NaN,np.NaN],
 'c4':[6,np.NaN,np.NaN],
 'c5':[7,np.NaN,np.NaN],
 'd1':[8,np.NaN,8]}

df2 = pd.DataFrame(d)

    id  a1  b1  c1  c2  c3  c4  c5  d1
0   1   1.0 2.0 3   4   5.0 6.0 7.0 8.0
1   4   1.0 2.0 3   4   NaN NaN NaN NaN
2   4   NaN NaN 5   6   NaN NaN NaN 8.0

Scenario 3 (ID = 4): the same as scenario 2 however c5 (id4) would remain as a NaN value in this case.

9
  • do you always have the perfect number of values and all the rest NaNs? Commented Feb 1, 2022 at 15:23
  • @mozway no I don't always have the exact number of values, you should typically get at least one value for each alpha column e.g. [a1,b1,c1,d1] then you could get any number of values in this case for c. It is typical that you may only have a handful of rows in a DF that has values for all columns and then any number of rows with values in between c1 and c(n) Commented Feb 1, 2022 at 15:38
  • 1
    can you update your example to reflect this? and provide the details on which column has which role? Commented Feb 1, 2022 at 15:38
  • How do you decide on which columns should be unshifted? Will it always be d1? or is it based on pattern? can it vary among groups? Commented Feb 1, 2022 at 16:04
  • 1
    check the updated answer ;) Commented Feb 1, 2022 at 19:13

1 Answer 1

2

ideal number of values to fill the NaNs

Assuming you always have the ideal number of values, you can flatten per group without NaNs:

(df.set_index('id')
   .groupby(level=0)
   .apply(lambda d: d.stack().dropna().set_axis(df.columns[1:]))
   .reset_index()
)

output:

   id   a1   b1   c1   c2   c3   c4   c5   d1
0   1  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
1   2  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
2   3  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
alternative for an arbitrary number of "id"/grouping columns:
id_cols = ['id']
(df.set_index(id_cols)
   .groupby(level=range(len(id_cols)))
   .apply(lambda d: d.stack().dropna().set_axis(list(df.drop(columns=id_cols))))
   .reset_index()
)

filling values per group of columns

In this alternative the columns are grouped per Xn where X is a non-digit word and n the sublevel.

Here I used a double groupby, once on the rows, once on the columns:

def flat(d):
    return (d.groupby(d.columns.str.extract('(^\D+)', expand=False), axis=1)
             .apply(lambda d: pd.DataFrame([sorted(d.values.ravel(),
                                                   key=pd.isna)[:len(d.columns)]],
                                           columns=d.columns)
                    )).iloc[0]

(df2.set_index('id')
   .groupby(level=0)
   .apply(flat)
   .reset_index()
)

output:

0  id   a1   b1   c1   c2   c3   c4   c5   d1
0   1  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0
1   4  1.0  2.0  3.0  4.0  5.0  6.0  NaN  8.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.