1

I have the following pandas dataframe:

In:

df

out:

         A    B        C                                             D
0  0938320  usa   amazon              orange: $ 8.00| pineapple: $2.00
1  0938320  usa  alibaba                  orange: $ 8.00| apple: $2.00
2  0938320  usa     ebay  mint: $ 8.00| watermelon: $2.00| mint: $2.00
...
n  0938320  usa   amazon                  pear: $ 8.00| bannana: $2.00

I would like to split by | and stack it into (*):

         A    B        C                  D
0  0938320  usa   amazon     orange: $ 8.00
1  0938320  usa   amazon   pineapple: $2.00
2  0938320  usa  alibaba     orange: $ 8.00
3  0938320  usa  alibaba       apple: $2.00
4  0938320  usa      bay       mint: $ 8.00
5  0938320  usa     ebay  watermelon: $2.00
6  0938320  usa     ebay        mint: $2.00
7  0938320  usa   amazon       pear: $ 8.00
...
8  0938320  usa   amazon     bannana: $2.00

So, I tried the following:

In:

s = df2.D.str.split("|").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
del df2['D']
df.join(s)

out:

ValueError: Other Series must have a name

And:

b = pd.DataFrame(df2.D.str.split('|').tolist(), index=df2['A','B','C']).stack()
b = b.reset_index()[[0, 'D']] 
b.columns = ['A','B','C']
b

However, is not working. How can I modify the last approach in order to get (*)?. I guess that my main problem is that I do not know how to take all the columns in index=df2['A','B','C']).stack().

2 Answers 2

1

You could first set the 3 columns as index of the DF and perform splitting on the fourth column, D. Let the output obtained take the form of a dataframe using expand=True argument in str.split.

In [55]: df
Out[55]: 
        A     B         C                                              D
0  938320   usa    amazon               orange: $ 8.00| pineapple: $2.00
1  938320   usa   alibaba                   orange: $ 8.00| apple: $2.00
2  938320   usa      ebay   mint: $ 8.00| watermelon: $2.00| mint: $2.00

In [56]: df_split = df.set_index(['A', 'B', 'C'])['D'].str.split('|', expand=True)

In [57]: df_split
Out[57]: 
                                    0                   1             2
A      B    C                                                          
938320  usa  amazon    orange: $ 8.00    pineapple: $2.00          None
             alibaba   orange: $ 8.00        apple: $2.00          None
             ebay        mint: $ 8.00   watermelon: $2.00   mint: $2.00

Then, stack them to obtain a single wholesome column(dropping NaNs by default) and then rearrange it back using reset_index.

In [58]: df_split.stack().reset_index(level=[0,1,2], name='D').reset_index(drop=True)
Out[58]: 
        A     B         C                   D
0  938320   usa    amazon      orange: $ 8.00
1  938320   usa    amazon    pineapple: $2.00
2  938320   usa   alibaba      orange: $ 8.00
3  938320   usa   alibaba        apple: $2.00
4  938320   usa      ebay        mint: $ 8.00
5  938320   usa      ebay   watermelon: $2.00
6  938320   usa      ebay         mint: $2.00
Sign up to request clarification or add additional context in comments.

Comments

1

Here is an alternative using join to combine the split data.

# split D and get it into long/stacked format
productsLong = pd.DataFrame({'products':
                df['D'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)})

# join the data together on the indices
df[['A', 'B', 'C']].join(productsLong)

Out[56]: 
        A    B        C            products
0  938320  usa   amazon      orange: $ 8.00
0  938320  usa   amazon    pineapple: $2.00
1  938320  usa  alibaba      orange: $ 8.00
1  938320  usa  alibaba        apple: $2.00
2  938320  usa     ebay        mint: $ 8.00
2  938320  usa     ebay   watermelon: $2.00
2  938320  usa     ebay         mint: $2.00
3  938320  usa   amazon        pear: $ 8.00
3  938320  usa   amazon      bannana: $2.00

Notes
the rename method was returning an error, so I cast the Series into a DataFrame in order to provide a column name. reset_index with the levels=1 removes the "outer" index, keeping the index of the original DataFrame (repeated properly for the join operation).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.