Python Pandas: Create new rows in dataFrame based on two columns

Question

I have the following dataframe 'df' based on which I'd like to create a new df 'new_df'. I have some troubles getting the new df.

   Cust-id   Sex  Country  Orders           Products
0   'Cu1'    'F'   'FR'   'ord1 + ord2'     'A+G'
1   'Cu2'    'M'   'US'   'ord3'            'C'
2   'Cu3'    'M'   'UK'   'ord4 + ord5'     'H+Z'
3   'Cu4'    'F'   'RU'   'ord6'            'K'
4   'Cu5'    'M'   'US'   'ord7'            'T'
5    NaN     'M'   'UK'   'ord#'            'K'
6   'Cu6'    'F'   'US'   'ord8+ord9+ord10' 'R+D+S'  
7   'Cu7'    'M'   'UK'   'ord11'           'A'

I'd like the 'new_df' to contain a row for each 'order' with corresponding 'product'. All other columns keep their contents. Also, if a row in the 'Cust-id' column is NaN that complete row should be deleted (i.e. not present in the new df). This would give the following new_df:

   Cust-id   Sex  Country  Orders   Products
0   'Cu1'    'F'   'FR'   'ord1'     'A'
1   'Cu1'    'F'   'FR'   'ord2'     'G'
2   'Cu2'    'M'   'US'   'ord3'     'C'
3   'Cu3'    'M'   'UK'   'ord4'     'H'
4   'Cu3'    'M'   'UK'   'ord5'     'Z'
5   'Cu4'    'F'   'RU'   'ord6'     'K'
6   'Cu5'    'M'   'US'   'ord7'     'T'
7   'Cu6'    'F'   'US'   'ord8'     'R'  
8   'Cu6'    'F'   'US'   'ord9'     'D' 
9   'Cu6'    'F'   'US'   'ord10'    'S'   
10  'Cu7'    'M'   'UK'   'ord11'    'A'

Any help/guidance is appreciated.

jezrael · Accepted Answer · 2016-07-25 05:57:21Z

1

You can use:

#remove ', split by +, create Series
s1 = df.Products.str.strip("'") 
                .str.split('+', expand=True)
                .stack()
                .reset_index(drop=True, level=1)

#remove ', split by +, create Series, strip spaces                    
s2 = df.Orders.str.strip("'")
              .str.split('+', expand=True)
              .stack().str.strip()
              .reset_index(drop=True, level=1)

#if need add '
s1 = "'" + s1  + "'"
s2 = "'" + s2  + "'"
df1 = pd.DataFrame({'Products':s1, 'Orders':s2}, index=s1.index)
print (df1)
    Orders Products
0   'ord1'      'A'
0   'ord2'      'G'
1   'ord3'      'C'
2   'ord4'      'H'
2   'ord5'      'Z'
3   'ord6'      'K'
4   'ord7'      'T'
5   'ord#'      'K'
6   'ord8'      'R'
6   'ord9'      'D'
6  'ord10'      'S'
7  'ord11'      'A'

#delete old columns, join df1, drop df if NaN in Cust-id
print(df.drop(['Orders', 'Products'], axis=1)
        .join(df1)
        .dropna(subset=['Cust-id'])
        .reset_index(drop=True))

   Cust-id  Sex Country   Orders Products
0    'Cu1'  'F'    'FR'   'ord1'      'A'
1    'Cu1'  'F'    'FR'   'ord2'      'G'
2    'Cu2'  'M'    'US'   'ord3'      'C'
3    'Cu3'  'M'    'UK'   'ord4'      'H'
4    'Cu3'  'M'    'UK'   'ord5'      'Z'
5    'Cu4'  'F'    'RU'   'ord6'      'K'
6    'Cu5'  'M'    'US'   'ord7'      'T'
7    'Cu6'  'F'    'US'   'ord8'      'R'
8    'Cu6'  'F'    'US'   'ord9'      'D'
9    'Cu6'  'F'    'US'  'ord10'      'S'
10   'Cu7'  'M'    'UK'  'ord11'      'A'

EDIT by comment:

Use concat for creating df1:

...
...
df1 = pd.concat([s1, s2], keys=('Orders', 'Products'), axis=1)
print (df1)
  Orders Products
0    'A'   'ord1'
0    'G'   'ord2'
1    'C'   'ord3'
2    'H'   'ord4'
2    'Z'   'ord5'
3    'K'   'ord6'
4    'T'   'ord7'
5    'K'   'ord#'
6    'R'   'ord8'
6    'D'   'ord9'
6    'S'  'ord10'
7    'A'  'ord11'

print(df.drop(['Orders', 'Products'], axis=1)
        .join(df1)
        .dropna(subset=['Cust-id'])
        .reset_index(drop=True))

   Cust-id  Sex Country Orders Products
0    'Cu1'  'F'    'FR'    'A'   'ord1'
1    'Cu1'  'F'    'FR'    'G'   'ord2'
2    'Cu2'  'M'    'US'    'C'   'ord3'
3    'Cu3'  'M'    'UK'    'H'   'ord4'
4    'Cu3'  'M'    'UK'    'Z'   'ord5'
5    'Cu4'  'F'    'RU'    'K'   'ord6'
6    'Cu5'  'M'    'US'    'T'   'ord7'
7    'Cu6'  'F'    'US'    'R'   'ord8'
8    'Cu6'  'F'    'US'    'D'   'ord9'
9    'Cu6'  'F'    'US'    'S'  'ord10'
10   'Cu7'  'M'    'UK'    'A'  'ord11'

edited Jul 25, 2016 at 5:57

answered Jul 22, 2016 at 13:14

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

J_Dav Over a year ago

Thanks for your help jezrael, much appreciated. When creating df1 a ValueError occurs ("cannot reindex from a duplicate axis"). Any idea how to fix this?

jezrael Over a year ago

Then try df = pd.concat([s1, s2], keys=('Orders', 'Products'), axis=1) Sorry, untested because I am only on phone.

J_Dav Over a year ago

thanks a lot, but your intial code does seem to work, there was an error in my data.

Collectives™ on Stack Overflow

Python Pandas: Create new rows in dataFrame based on two columns

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related