0

hello i am working with a dataset in .csv with python and am having errors grouping the columns. the code i'm working with is:

import pandas as pd 
df=pd.read_excel('filepath')
df['Items'].str.split(',', expand=True)
df=df.groupby(['Items0', 'Items1','Items2', 'Items3', 'Items4', 'Items5' ]).size()
print(df)

when i run the print(df) i get values such as Items0-1, Items1-1, Items2-1 and so on

this is the sample data i am working with and how i am trying to organize it is below.

can someone direct me to how to solve this?

sample data:

Name Date Items
johnny smith 09/1/2021 bread, oranges, peanut butter, apples, celery, peanuts
granny smith 08/31/2021 oranges, peanut butter, apples, bread
jane doe 09/01/2021 oranges, apples, celery, peanut butter
jack frost 08/01/2021 bread, oranges, apples
cinderella 08/16/2021 apples, peanuts, bread

what i am attempting to achieve:

Name Date Items0 Items1 Items2 Items3 Items4 Items5
johnny smith 09/1/2021 bread oranges peanut butter apples celery peanuts
granny smith 08/31/2021 bread oranges peanut butter apples
jane doe 09/01/2021 oranges peanut butter apples
jack frost 08/01/2021 bread oranges apples
cinderella 08/16/2021 bread apples peanuts

1 Answer 1

1

One approach would be to derive categorical values from the string in Items then reformat the DataFrame and join back together:

x = df.pop('Items').str.get_dummies(', ')
df = df.join(
    x.mul(x.columns).set_axis(
        range(len(x.columns)), axis=1
    ).add_prefix('Item')
)
           Name        Date   Item0  Item1   Item2    Item3          Item4    Item5
0  johnny smith   09/1/2021  apples  bread  celery  oranges  peanut butter  peanuts
1  granny smith  08/31/2021  apples  bread          oranges  peanut butter         
2      jane doe  09/01/2021  apples         celery  oranges  peanut butter         
3    jack frost  08/01/2021  apples  bread          oranges                        
4    cinderella  08/16/2021  apples  bread                                  peanuts

Explanations: use str.get_dummies to convert the Item column to categorical values:

x = df.pop('Items').str.get_dummies(', ')
   apples  bread  celery  oranges  peanut butter  peanuts
0       1      1       1        1              1        1
1       1      1       0        1              1        0
2       1      0       1        1              1        0
3       1      1       0        1              0        0
4       1      1       0        0              0        1

This can be converted into the desired format by mul by the column names (replacing the 1 values with the column name), set_axis to enumerate columns (0-number of items) and add_prefix of 'Item' to the newly numbered columns:

x.mul(x.columns).set_axis(
    range(len(x.columns)), axis=1
).add_prefix('Item')

x:

    Item0  Item1   Item2    Item3          Item4    Item5
0  apples  bread  celery  oranges  peanut butter  peanuts
1  apples  bread          oranges  peanut butter         
2  apples         celery  oranges  peanut butter         
3  apples  bread          oranges                        
4  apples  bread                                  peanuts

Then join back to the remaining columns in df .

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.