3

I have the following dataframe, which looks like the below:

df = pd.DataFrame({'fruit': ['berries','berries', 'berries', 'tropical', 
'tropical','tropical','berries','nuts'], 
           'code': [100,100,100,200,200, 300,400,500],
           'subcode': ['100A', '100B', '100C','200A', '200B','300A', 
           '400A', '500A']})


    code    fruit   subcode
  0 100     berries 100A
  1 100     berries 100B
  2 100     berries 100C
  3 200     tropica 200A
  4 200     tropical 200B
  5 300     tropical 300A
  6 400     berries 400A
  7 500     nuts    500A

I want to transform the dataframe to this format:

    code    fruit   subcode1 subcode1 subcode1
  0 100     berries 100A      100B   100C
  3 200     tropica 200A      200B
  5 300     tropical 300A
  6 400     berries 400A
  7 500     nuts    500A 

Unfortunately, I'm stuck as to how to proceed. I've consulted posts like, Unmelt Pandas DataFrame, and have combinations of stack and unstack. I suspect that some concatenation is involved, too. Would appreciate any advice to help point me in the right direction!

0

3 Answers 3

4

You can use groupby, take the values and convert them to series.

df.groupby(['code','fruit'])['subcode'].apply(
         lambda x: x.values
      ).apply(pd.Series)
       .add_prefix('subcode_')

                subcode_0 subcode_1 subcode_2
code fruit                                 
100  berries       100A      100B      100C
200  tropical      200A      200B       NaN
300  tropical      300A       NaN       NaN
400  berries       400A       NaN       NaN
500  nuts          500A       NaN       NaN
Sign up to request clarification or add additional context in comments.

5 Comments

I like this approach, but I dislike the apply(Series). Good effort though!
I doo agreee, consumes a lotta time.
Is there any difference between applying ravel versus list?
@ALollz I just realize that's unnecessary .
thanks so much! this totally works on my data set and i learned that there is an .add_prefix/sufix to row/col labels.
4

Play around a bit with set_index and unstack, and you'll get it.

(df.set_index(['code', 'fruit'])
   .set_index(df.subcode.str.extract('([a-zA-Z]+)', expand=False), append=True)
   .subcode
   .unstack()
   .fillna('')                  # these last three 
   .reset_index()               # operations are  
   .rename_axis(None, axis=1)   # not important
)

   code     fruit     A     B     C
0   100   berries  100A  100B  100C
1   200  tropical  200A  200B      
2   300  tropical  300A            
3   400   berries  400A            
4   500      nuts  500A            

Comments

3

With defaultdict

from collections import defaultdict


d = defaultdict(list)

for f, c, s in df.itertuples(index=False):
    d[(f, c)].append(s)

pd.DataFrame.from_dict(
    {k: dict(enumerate(v)) for k, v in d.items()}, orient='index'
).add_prefix('subcode').rename_axis(['fruit', 'code']).reset_index()

      fruit  code subcode0 subcode1 subcode2
0   berries   100     100A     100B     100C
1   berries   400     400A      NaN      NaN
2      nuts   500     500A      NaN      NaN
3  tropical   200     200A     200B      NaN
4  tropical   300     300A      NaN      NaN

1 Comment

thanks! I'll will have to read up on default dict to see how it works. definitely appreciate learning different approaches.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.