Aggregate Over Multiple Columns with Multiple Data Types in Pandas

Question

I have a data frame such that the variables in the first n columns are the same over, for instance, 2 rows and I would like to aggregate over the renaming columns that are of type float. Here is an example:

import pandas as pd
import numpy as np

data=[[1,2,np.nan,'string', 100, 200],[1,2,np.nan,'string',102,202],[1,2,5,0.5,1000,2000],[1,2,5,0.5,1002,2002]]


pd.DataFrame(data=data,columns=['Var1','Var2','Var3','Var4','Var5','Var6'])

   Var1  Var2  Var3    Var4  Var5  Var6
0     1     2   NaN  string   100   200
1     1     2   NaN  string   102   202
2     1     2   5.0     0.5  1000  2000
3     1     2   5.0     0.5  1002  2002

So in this data frame, I would like to find the average of Var5 and Var6 over each 2 rows. The intended output would be the following:

   Var1  Var2  Var3    Var4  Var5  Var6
0     1     2   NaN  string   101   201
1     1     2   5.0     0.5  1001  2001

Is there a way to do this given data types of the same features are not consistent? For instance, Var3 can be nan and a float.

MrNobody33 · Accepted Answer · 2020-08-05 21:54:14Z

You can try:

dc=dict(zip(df.columns,np.where(df.dtypes!='object','mean','first')))
df.groupby(df.index//2).agg(dc)

Output:

   Var1  Var2  Var3    Var4  Var5  Var6
0     1     2   NaN  string   101   201
1     1     2   5.0     0.5  1001  2001

Details:

To get the dictionary with the functions:

When a column has a mixed types of values or all the type of the values is string, the dtype of the column is object, so you can mask the columns to get the "non-dtype-object" columns, and with np.where put the function mean to the columns different to object and first to the others:

df.dtypes
#Var1      int64
#Var2      int64
#Var3    float64
#Var4     object
#Var5      int64
#Var6      int64
dtype: object

np.where(df.dtypes!='object','mean','first')
#['mean' 'mean' 'mean' 'first' 'mean' 'mean']

dc=dict(zip(df.columns,np.where(df.dtypes!='object','mean','first')))
dc
#{'Var1': 'mean', 'Var2': 'mean', 'Var3': 'mean', 'Var4': 'first', 'Var5': 'mean', 'Var6': 'mean'}

To group by two rows:

You can use groupby with argument df.index//2 to slice the dataframe in every two rows, and after that, use agg with the dictionary created before

df.index//2
#Int64Index([0, 0, 1, 1], dtype='int64')

df.groupby(df.index//2).agg(dc)

sammywemmy · Accepted Answer · 2020-08-05 21:34:10Z

0

Pandas 1.1 supports Null values in the groupby indexes :

columns = df.columns[:4].tolist()
df.groupby(columns, dropna=False, sort=False).agg("mean")

                                Var5    Var6
Var1    Var2    Var3    Var4        
1        2       NaN    string   101    201
                 5.0    0.5     1001    2001

answered Aug 5, 2020 at 21:34

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Collectives™ on Stack Overflow

Aggregate Over Multiple Columns with Multiple Data Types in Pandas

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related