1

I have the following dataframe 'df' based on which I'd like to create a new df 'new_df'. I have some troubles getting the new df.

   Cust-id   Sex  Country  Orders           Products
0   'Cu1'    'F'   'FR'   'ord1 + ord2'     'A+G'
1   'Cu2'    'M'   'US'   'ord3'            'C'
2   'Cu3'    'M'   'UK'   'ord4 + ord5'     'H+Z'
3   'Cu4'    'F'   'RU'   'ord6'            'K'
4   'Cu5'    'M'   'US'   'ord7'            'T'
5    NaN     'M'   'UK'   'ord#'            'K'
6   'Cu6'    'F'   'US'   'ord8+ord9+ord10' 'R+D+S'  
7   'Cu7'    'M'   'UK'   'ord11'           'A'

I'd like the 'new_df' to contain a row for each 'order' with corresponding 'product'. All other columns keep their contents. Also, if a row in the 'Cust-id' column is NaN that complete row should be deleted (i.e. not present in the new df). This would give the following new_df:

   Cust-id   Sex  Country  Orders   Products
0   'Cu1'    'F'   'FR'   'ord1'     'A'
1   'Cu1'    'F'   'FR'   'ord2'     'G'
2   'Cu2'    'M'   'US'   'ord3'     'C'
3   'Cu3'    'M'   'UK'   'ord4'     'H'
4   'Cu3'    'M'   'UK'   'ord5'     'Z'
5   'Cu4'    'F'   'RU'   'ord6'     'K'
6   'Cu5'    'M'   'US'   'ord7'     'T'
7   'Cu6'    'F'   'US'   'ord8'     'R'  
8   'Cu6'    'F'   'US'   'ord9'     'D' 
9   'Cu6'    'F'   'US'   'ord10'    'S'   
10  'Cu7'    'M'   'UK'   'ord11'    'A'

Any help/guidance is appreciated.

1 Answer 1

1

You can use:

#remove ', split by +, create Series
s1 = df.Products.str.strip("'") 
                .str.split('+', expand=True)
                .stack()
                .reset_index(drop=True, level=1)

#remove ', split by +, create Series, strip spaces                    
s2 = df.Orders.str.strip("'")
              .str.split('+', expand=True)
              .stack().str.strip()
              .reset_index(drop=True, level=1)

#if need add '
s1 = "'" + s1  + "'"
s2 = "'" + s2  + "'"
df1 = pd.DataFrame({'Products':s1, 'Orders':s2}, index=s1.index)
print (df1)
    Orders Products
0   'ord1'      'A'
0   'ord2'      'G'
1   'ord3'      'C'
2   'ord4'      'H'
2   'ord5'      'Z'
3   'ord6'      'K'
4   'ord7'      'T'
5   'ord#'      'K'
6   'ord8'      'R'
6   'ord9'      'D'
6  'ord10'      'S'
7  'ord11'      'A'
#delete old columns, join df1, drop df if NaN in Cust-id
print(df.drop(['Orders', 'Products'], axis=1)
        .join(df1)
        .dropna(subset=['Cust-id'])
        .reset_index(drop=True))

   Cust-id  Sex Country   Orders Products
0    'Cu1'  'F'    'FR'   'ord1'      'A'
1    'Cu1'  'F'    'FR'   'ord2'      'G'
2    'Cu2'  'M'    'US'   'ord3'      'C'
3    'Cu3'  'M'    'UK'   'ord4'      'H'
4    'Cu3'  'M'    'UK'   'ord5'      'Z'
5    'Cu4'  'F'    'RU'   'ord6'      'K'
6    'Cu5'  'M'    'US'   'ord7'      'T'
7    'Cu6'  'F'    'US'   'ord8'      'R'
8    'Cu6'  'F'    'US'   'ord9'      'D'
9    'Cu6'  'F'    'US'  'ord10'      'S'
10   'Cu7'  'M'    'UK'  'ord11'      'A'     

EDIT by comment:

Use concat for creating df1:

...
...
df1 = pd.concat([s1, s2], keys=('Orders', 'Products'), axis=1)
print (df1)
  Orders Products
0    'A'   'ord1'
0    'G'   'ord2'
1    'C'   'ord3'
2    'H'   'ord4'
2    'Z'   'ord5'
3    'K'   'ord6'
4    'T'   'ord7'
5    'K'   'ord#'
6    'R'   'ord8'
6    'D'   'ord9'
6    'S'  'ord10'
7    'A'  'ord11'

print(df.drop(['Orders', 'Products'], axis=1)
        .join(df1)
        .dropna(subset=['Cust-id'])
        .reset_index(drop=True))

   Cust-id  Sex Country Orders Products
0    'Cu1'  'F'    'FR'    'A'   'ord1'
1    'Cu1'  'F'    'FR'    'G'   'ord2'
2    'Cu2'  'M'    'US'    'C'   'ord3'
3    'Cu3'  'M'    'UK'    'H'   'ord4'
4    'Cu3'  'M'    'UK'    'Z'   'ord5'
5    'Cu4'  'F'    'RU'    'K'   'ord6'
6    'Cu5'  'M'    'US'    'T'   'ord7'
7    'Cu6'  'F'    'US'    'R'   'ord8'
8    'Cu6'  'F'    'US'    'D'   'ord9'
9    'Cu6'  'F'    'US'    'S'  'ord10'
10   'Cu7'  'M'    'UK'    'A'  'ord11'
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your help jezrael, much appreciated. When creating df1 a ValueError occurs ("cannot reindex from a duplicate axis"). Any idea how to fix this?
Then try df = pd.concat([s1, s2], keys=('Orders', 'Products'), axis=1) Sorry, untested because I am only on phone.
thanks a lot, but your intial code does seem to work, there was an error in my data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.