0

I have a data frame such that the variables in the first n columns are the same over, for instance, 2 rows and I would like to aggregate over the renaming columns that are of type float. Here is an example:

import pandas as pd
import numpy as np

data=[[1,2,np.nan,'string', 100, 200],[1,2,np.nan,'string',102,202],[1,2,5,0.5,1000,2000],[1,2,5,0.5,1002,2002]]


pd.DataFrame(data=data,columns=['Var1','Var2','Var3','Var4','Var5','Var6'])

   Var1  Var2  Var3    Var4  Var5  Var6
0     1     2   NaN  string   100   200
1     1     2   NaN  string   102   202
2     1     2   5.0     0.5  1000  2000
3     1     2   5.0     0.5  1002  2002

So in this data frame, I would like to find the average of Var5 and Var6 over each 2 rows. The intended output would be the following:

   Var1  Var2  Var3    Var4  Var5  Var6
0     1     2   NaN  string   101   201
1     1     2   5.0     0.5  1001  2001

Is there a way to do this given data types of the same features are not consistent? For instance, Var3 can be nan and a float.

0

2 Answers 2

1

You can try:

dc=dict(zip(df.columns,np.where(df.dtypes!='object','mean','first')))
df.groupby(df.index//2).agg(dc)

Output:

   Var1  Var2  Var3    Var4  Var5  Var6
0     1     2   NaN  string   101   201
1     1     2   5.0     0.5  1001  2001

Details:

To get the dictionary with the functions:

When a column has a mixed types of values or all the type of the values is string, the dtype of the column is object, so you can mask the columns to get the "non-dtype-object" columns, and with np.where put the function mean to the columns different to object and first to the others:

df.dtypes
#Var1      int64
#Var2      int64
#Var3    float64
#Var4     object
#Var5      int64
#Var6      int64
dtype: object

np.where(df.dtypes!='object','mean','first')
#['mean' 'mean' 'mean' 'first' 'mean' 'mean']

dc=dict(zip(df.columns,np.where(df.dtypes!='object','mean','first')))
dc
#{'Var1': 'mean', 'Var2': 'mean', 'Var3': 'mean', 'Var4': 'first', 'Var5': 'mean', 'Var6': 'mean'}

To group by two rows:

You can use groupby with argument df.index//2 to slice the dataframe in every two rows, and after that, use agg with the dictionary created before

df.index//2
#Int64Index([0, 0, 1, 1], dtype='int64')

df.groupby(df.index//2).agg(dc)
Sign up to request clarification or add additional context in comments.

Comments

0

Pandas 1.1 supports Null values in the groupby indexes :

columns = df.columns[:4].tolist()
df.groupby(columns, dropna=False, sort=False).agg("mean")

                                Var5    Var6
Var1    Var2    Var3    Var4        
1        2       NaN    string   101    201
                 5.0    0.5     1001    2001

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.