2

What is the best option to create new DataFrame from a function applied to each row of a data frame. The ultimate goal is to concat (rbind) all the resulting new_dataframes.

Input:

   Name  Age
0   tom   10
1  nick   15
2  juli   14

Example:

import pandas as pd
import pdb

data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

def foo(row):
 #pdb.set_trace()
 new_df = row.to_frame(name='Values')
 new_df.loc[new_df.index=='Name','New_column'] = 'Surname'
 new_df.loc[new_df.index=='Age','New_column'] = '+5 months'
 return new_df

df.apply(foo, axis=1)

Output:

data = {'Values':['Tom', '10', 'nich', '15', 'juli', '14'], 
'New_column': ['Surname', '+5 months', 'Surname', '+5 months', 'Surname', 
'+5 months']}
output = pd.DataFrame(data)

 Values New_column
0    Tom    Surname
1     10  +5 months
2   nich    Surname
3     15  +5 months
4   juli    Surname
5     14  +5 months

If .apply() is not the best option, I would appreciate an alternative.

For R users, I am looking for do.call(rbind, sapply())

Thanks.

2
  • What is your expected output? Commented Nov 1, 2019 at 10:33
  • 1
    I put the Input and final Output on the question. Hope now makes it easier. Commented Nov 1, 2019 at 10:53

4 Answers 4

2

Start from one improvement in your function:

def foo(row):
    new_df = row.to_frame(name='Values')
    new_df.loc['Name', 'New_column'] = 'Surname'
    new_df.loc['Age', 'New_column'] = '+5 months'
    return new_df

("new_df.index==" is not needed).

To get your output, convert the Series of DataFrames (resulting from apply) into an ordinaty list (of DataFrames) and concatenate them.

The code to do it is:

pd.concat(df.apply(foo, axis=1).tolist())
Sign up to request clarification or add additional context in comments.

Comments

1

Without using apply which is pretty slow, we can use pandas and numpy methods here: transform, melt and numpy.tile:

df = df.T.melt().drop(columns='variable')
df['New_column'] = np.tile(['Surname', '5+ months'], len(df)//2)

  value New_column
0   tom    Surname
1    10  5+ months
2  nick    Surname
3    15  5+ months
4  juli    Surname
5    14  5+ months

Comments

0

Here a different approach that is using built-in functions of pandas and numpy.

import pandas as pd
import numpy as np
import pdb

# create df
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

# provide unique ids for each row
df['id']=df.index
# Unpivot DataFrame using unique id as reference
n = df.melt(id_vars=['id'], value_vars=['Name', 'Age'])
# add 'new_column' and updates its values with np.where
n['new_column'] = np.where(n['variable'] == 'Name', 'Surname', '+5 months')
# sort df to pair name and age
n.sort_values('id', inplace=True)
# assign row names
n.index = n['variable']
# drop unnecessary columns
n.drop(['id', 'variable'], axis = 1)

output:

           value    new_column
variable        
Name       tom      Surname
Age        10       +5 months
Name       nick     Surname
Age        15       +5 months
Name       juli     Surname
Age        14       +5 months

Comments

0

Perhaps try:

df = df.apply(foo, axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.