Problems while spliting pandas dataframe row?

Question

I have the following pandas dataframe:

In:

df

out:

         A    B        C                                             D
0  0938320  usa   amazon              orange: $ 8.00| pineapple: $2.00
1  0938320  usa  alibaba                  orange: $ 8.00| apple: $2.00
2  0938320  usa     ebay  mint: $ 8.00| watermelon: $2.00| mint: $2.00
...
n  0938320  usa   amazon                  pear: $ 8.00| bannana: $2.00

I would like to split by | and stack it into (*):

         A    B        C                  D
0  0938320  usa   amazon     orange: $ 8.00
1  0938320  usa   amazon   pineapple: $2.00
2  0938320  usa  alibaba     orange: $ 8.00
3  0938320  usa  alibaba       apple: $2.00
4  0938320  usa      bay       mint: $ 8.00
5  0938320  usa     ebay  watermelon: $2.00
6  0938320  usa     ebay        mint: $2.00
7  0938320  usa   amazon       pear: $ 8.00
...
8  0938320  usa   amazon     bannana: $2.00

So, I tried the following:

In:

s = df2.D.str.split("|").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
del df2['D']
df.join(s)

out:

ValueError: Other Series must have a name

And:

b = pd.DataFrame(df2.D.str.split('|').tolist(), index=df2['A','B','C']).stack()
b = b.reset_index()[[0, 'D']] 
b.columns = ['A','B','C']
b

However, is not working. How can I modify the last approach in order to get (*)?. I guess that my main problem is that I do not know how to take all the columns in index=df2['A','B','C']).stack().

Nickil Maveli · Accepted Answer · 2016-12-20 14:59:21Z

You could first set the 3 columns as index of the DF and perform splitting on the fourth column, D. Let the output obtained take the form of a dataframe using expand=True argument in str.split.

In [55]: df
Out[55]: 
        A     B         C                                              D
0  938320   usa    amazon               orange: $ 8.00| pineapple: $2.00
1  938320   usa   alibaba                   orange: $ 8.00| apple: $2.00
2  938320   usa      ebay   mint: $ 8.00| watermelon: $2.00| mint: $2.00

In [56]: df_split = df.set_index(['A', 'B', 'C'])['D'].str.split('|', expand=True)

In [57]: df_split
Out[57]: 
                                    0                   1             2
A      B    C                                                          
938320  usa  amazon    orange: $ 8.00    pineapple: $2.00          None
             alibaba   orange: $ 8.00        apple: $2.00          None
             ebay        mint: $ 8.00   watermelon: $2.00   mint: $2.00

Then, stack them to obtain a single wholesome column(dropping NaNs by default) and then rearrange it back using reset_index.

In [58]: df_split.stack().reset_index(level=[0,1,2], name='D').reset_index(drop=True)
Out[58]: 
        A     B         C                   D
0  938320   usa    amazon      orange: $ 8.00
1  938320   usa    amazon    pineapple: $2.00
2  938320   usa   alibaba      orange: $ 8.00
3  938320   usa   alibaba        apple: $2.00
4  938320   usa      ebay        mint: $ 8.00
5  938320   usa      ebay   watermelon: $2.00
6  938320   usa      ebay         mint: $2.00

lmo · Accepted Answer · 2016-12-20 17:02:27Z

Here is an alternative using join to combine the split data.

# split D and get it into long/stacked format
productsLong = pd.DataFrame({'products':
                df['D'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)})

# join the data together on the indices
df[['A', 'B', 'C']].join(productsLong)

Out[56]: 
        A    B        C            products
0  938320  usa   amazon      orange: $ 8.00
0  938320  usa   amazon    pineapple: $2.00
1  938320  usa  alibaba      orange: $ 8.00
1  938320  usa  alibaba        apple: $2.00
2  938320  usa     ebay        mint: $ 8.00
2  938320  usa     ebay   watermelon: $2.00
2  938320  usa     ebay         mint: $2.00
3  938320  usa   amazon        pear: $ 8.00
3  938320  usa   amazon      bannana: $2.00

Notes
the rename method was returning an error, so I cast the Series into a DataFrame in order to provide a column name. reset_index with the levels=1 removes the "outer" index, keeping the index of the original DataFrame (repeated properly for the join operation).

Collectives™ on Stack Overflow

Problems while spliting pandas dataframe row?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related