0

I have a dataframe that has only one column 'Desc' and it looks like this.

Desc
AB - 01
123 AB NEXT - 01
010 EMPLOYEE - 23
020 DEMAND 80
010 EMPLOYEE 45
020 DEMAND 28
AAAAA............
BBBBB.............
AB - 02
123 AB NEXT - 02
010 EMPLOYEE - 48
020 DEMAND - 87
010 EMPLOYEE - 94
020 DEMAND - 09
050 EMPLOYEE - 88
060 DEMAND - 90
BBBBBB..........
GGGGGG..........

I want to manipulate data in a way that produces the following output.

Col 1 Col 2 Col 3 Col 4 Col 5 Col 6
AB - 01 123 AB NEXT -01 010 EMPLOYEE 23 020 DEMAND 80 NULL NULL
AB - 01 123 AB NEXT -01 010 EMPLOYEE 45 020 DEMAND 28 NULL NULL
AB - 02 123 AB NEXT -02 010 EMPLOYEE 48 020 DEMAND 87 050 EMPLOYEE 88 060 DEMAND - 90
AB - 02 123 AB NEXT -02 010 EMPLOYEE 94 020 DEMAND 09 NULL NULL

So basically, each AB is a broad category which has columns 010, 020, and so on. I was thinking of using this approach. The code looks for row staring with AB, create a column for that and that parse the data under it by placing each 010, 020 (all such numbers) in it's separate column until it encounters the next AB. Moreover, this is just an extract of the dataframe as it goes on with different ABs.

0

1 Answer 1

1

Use numpy.array_split to split the DataFrame on new "AB" rows and for each frame in the resulting list:

  1. split the data at the first space to get the column number (010, 020, etc.)
  2. groupby the column number and cumcount to get a unique index (row count)
  3. pivot the data
  4. append to output:
import numpy as np

groups = np.array_split(df, df[df["Desc"].str.contains("AB - \d{2}", regex=True)].index)

output = pd.DataFrame()
for group in groups:
    srs = group.squeeze()
    if srs.shape[0]==0:
        continue
    #first two rows are copied for all records
    split_cols = srs[2:].str.split(n=1, expand=True)
    split_cols[1] = srs.where(split_cols[0].str.isnumeric())
    split_cols = split_cols.dropna()
    split_cols["idx"] = split_cols.groupby(0)[1].transform("cumcount")
    
    temp = split_cols.pivot("idx", 0, 1)
    temp.insert(0, "Col 1", srs.iat[0])
    temp.insert(1, "Col 2", srs.iat[0])
    output = output.append(temp, ignore_index=True)

#rename columns if needed
output.columns = [f"Col {i+1}" for i in range(len(output.columns))]

>>> output

     Col 1    Col 2              Col 3            Col 4              Col 5  \
0  AB - 01  AB - 01  010 EMPLOYEE - 23    020 DEMAND 80                NaN   
1  AB - 01  AB - 01    010 EMPLOYEE 45    020 DEMAND 28                NaN   
2  AB - 02  AB - 02  010 EMPLOYEE - 48  020 DEMAND - 87  050 EMPLOYEE - 88   
3  AB - 02  AB - 02  010 EMPLOYEE - 94  020 DEMAND - 09                NaN   

             Col 6  
0              NaN  
1              NaN  
2  060 DEMAND - 90  
3              NaN  
Sign up to request clarification or add additional context in comments.

2 Comments

I am not able to understand the use of squeeze() here. What is it doing?
Changing the DataFrame to a Series. You just have one column of data so you can work with a Series.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.