Data Manipulation in pandas - Creating multiple columns based on multiple conditions

Question

I have a dataframe that has only one column 'Desc' and it looks like this.

Desc
AB - 01
123 AB NEXT - 01
010 EMPLOYEE - 23
020 DEMAND 80
010 EMPLOYEE 45
020 DEMAND 28
AAAAA............
BBBBB.............
AB - 02
123 AB NEXT - 02
010 EMPLOYEE - 48
020 DEMAND - 87
010 EMPLOYEE - 94
020 DEMAND - 09
050 EMPLOYEE - 88
060 DEMAND - 90
BBBBBB..........
GGGGGG..........

I want to manipulate data in a way that produces the following output.

Col 1	Col 2	Col 3	Col 4	Col 5	Col 6
AB - 01	123 AB NEXT -01	010 EMPLOYEE 23	020 DEMAND 80	NULL	NULL
AB - 01	123 AB NEXT -01	010 EMPLOYEE 45	020 DEMAND 28	NULL	NULL
AB - 02	123 AB NEXT -02	010 EMPLOYEE 48	020 DEMAND 87	050 EMPLOYEE 88	060 DEMAND - 90
AB - 02	123 AB NEXT -02	010 EMPLOYEE 94	020 DEMAND 09	NULL	NULL

So basically, each AB is a broad category which has columns 010, 020, and so on. I was thinking of using this approach. The code looks for row staring with AB, create a column for that and that parse the data under it by placing each 010, 020 (all such numbers) in it's separate column until it encounters the next AB. Moreover, this is just an extract of the dataframe as it goes on with different ABs.

not_speshal · Accepted Answer · 2021-11-22 19:37:09Z

1

Use numpy.array_split to split the DataFrame on new "AB" rows and for each frame in the resulting list:

split the data at the first space to get the column number (010, 020, etc.)
groupby the column number and cumcount to get a unique index (row count)
pivot the data
append to output:

import numpy as np

groups = np.array_split(df, df[df["Desc"].str.contains("AB - \d{2}", regex=True)].index)

output = pd.DataFrame()
for group in groups:
    srs = group.squeeze()
    if srs.shape[0]==0:
        continue
    #first two rows are copied for all records
    split_cols = srs[2:].str.split(n=1, expand=True)
    split_cols[1] = srs.where(split_cols[0].str.isnumeric())
    split_cols = split_cols.dropna()
    split_cols["idx"] = split_cols.groupby(0)[1].transform("cumcount")
    
    temp = split_cols.pivot("idx", 0, 1)
    temp.insert(0, "Col 1", srs.iat[0])
    temp.insert(1, "Col 2", srs.iat[0])
    output = output.append(temp, ignore_index=True)

#rename columns if needed
output.columns = [f"Col {i+1}" for i in range(len(output.columns))]

>>> output

     Col 1    Col 2              Col 3            Col 4              Col 5  \
0  AB - 01  AB - 01  010 EMPLOYEE - 23    020 DEMAND 80                NaN   
1  AB - 01  AB - 01    010 EMPLOYEE 45    020 DEMAND 28                NaN   
2  AB - 02  AB - 02  010 EMPLOYEE - 48  020 DEMAND - 87  050 EMPLOYEE - 88   
3  AB - 02  AB - 02  010 EMPLOYEE - 94  020 DEMAND - 09                NaN   

             Col 6  
0              NaN  
1              NaN  
2  060 DEMAND - 90  
3              NaN

answered Nov 22, 2021 at 19:37

not_speshal

23.2k2 gold badges18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ABC Over a year ago

I am not able to understand the use of squeeze() here. What is it doing?

not_speshal Over a year ago

Changing the DataFrame to a Series. You just have one column of data so you can work with a Series.

Collectives™ on Stack Overflow

Data Manipulation in pandas - Creating multiple columns based on multiple conditions

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related