Get the mean across multiple Pandas DataFrames

Question

I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.

         Source.0  Source.1  Source.2  Source.3
cluster                                        
0        0.001182  0.184535  0.814230  0.000054
1        0.000001  0.160490  0.839508  0.000001
2        0.000001  0.173829  0.826114  0.000055
3        0.000432  0.180065  0.819502  0.000001
4        0.000152  0.157041  0.842694  0.000113
5        0.000183  0.174142  0.825674  0.000001
6        0.000001  0.151556  0.848405  0.000038
7        0.000771  0.177583  0.821645  0.000001
8        0.000001  0.202059  0.797939  0.000001
9        0.000025  0.189537  0.810410  0.000028
10       0.006142  0.003041  0.493912  0.496905
11       0.003739  0.002367  0.514216  0.479678
12       0.002334  0.001517  0.529041  0.467108
13       0.003458  0.000001  0.532265  0.464276
14       0.000405  0.005655  0.527576  0.466364
15       0.002557  0.003233  0.507954  0.486256
16       0.004161  0.000001  0.491271  0.504568
17       0.001364  0.001330  0.528311  0.468996
18       0.002886  0.000001  0.506392  0.490721
19       0.001823  0.002498  0.509620  0.486059

         Source.0  Source.1  Source.2  Source.3
cluster                                        
0        0.000001  0.197108  0.802495  0.000396
1        0.000001  0.157860  0.842076  0.000063
2        0.094956  0.203057  0.701662  0.000325
3        0.000001  0.181948  0.817841  0.000210
4        0.000003  0.169680  0.830316  0.000001
5        0.000362  0.177194  0.822443  0.000001
6        0.000001  0.146807  0.852924  0.000268
7        0.001087  0.178994  0.819564  0.000354
8        0.000001  0.202182  0.797333  0.000485
9        0.000348  0.181399  0.818252  0.000001
10       0.003050  0.000247  0.506777  0.489926
11       0.004420  0.000001  0.513927  0.481652
12       0.006488  0.001396  0.527197  0.464919
13       0.001510  0.000001  0.525987  0.472502
14       0.000001  0.000001  0.520737  0.479261
15       0.000001  0.001765  0.515658  0.482575
16       0.000001  0.000001  0.492550  0.507448
17       0.002855  0.000199  0.526535  0.470411
18       0.000001  0.001952  0.498303  0.499744
19       0.001232  0.000001  0.506612  0.492155

Then I want to get the mean of these two dataframes.

What is the easiest way to do this?

Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.

So in the example I gave, the average for [0,Source.0] would be (0.001182 + 0.000001) / 2 = 0.0005915.

What mean exactly: one single value per variable, or one value per variable-cluster? — FooBar
– FooBar, Commented Jul 31, 2014 at 11:55
For the record, display your data as df.reset_index() if you want us to use your data for test cases. — FooBar
– FooBar, Commented Jul 31, 2014 at 13:06

ali_m · Accepted Answer · 2014-07-31 13:21:59Z

66

Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:

import numpy as np
import pandas as pd

# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))

# concatenate them
df_concat = pd.concat((df1, df2))

print df_concat.mean()
# x   -0.163044
# y    2.120000
# dtype: float64

print df_concat.median()
# x   -0.192037
# y    2.000000
# dtype: float64

Update

If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:

by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()

print df_means.head()
#           x    y
# 0 -0.850794  1.5
# 1  0.159038  1.5
# 2  0.083278  1.0
# 3 -0.540336  0.5
# 4  0.390954  3.5

This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.

edited Jul 31, 2014 at 13:21

answered Jul 31, 2014 at 11:44

ali_m

74.6k28 gold badges230 silver badges315 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tim Over a year ago

Thanks for the response, Bot the columns and the indexes are the same but I want to get the average accross only one position in the DF if that makes sense, so in the example i gave the average for [0,Source.0] would be (0.001182 + 0.000001) / 2

FooBar Over a year ago

You should have responded this to my question in the comments to the original question also ;)

ali_m Over a year ago

@TimRich .groupby() is probably the simplest way to do this - see @FooBar's answer and my update

FooBar · Accepted Answer · 2017-09-06 09:00:02Z

22

I go similar as @ali_m, but since you want one mean per row-column combination, I conclude differently:

df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()

          x    y
0  0.841282  2.5
1  0.716749  1.0
2 -0.551903  2.5
3  1.240736  1.5
4  1.227109  2.0

edited Sep 6, 2017 at 9:00

answered Jul 31, 2014 at 13:04

FooBar

16.7k20 gold badges94 silver badges189 bronze badges

4 Comments

ZJS Over a year ago

much faster than my general function, you can also use apply on this for more general functions

Markus W Over a year ago

great solution! I just had to use Level=1 to not Group on the dataframe names. foo = df.groupby(level=1).mean()

Mead Over a year ago

This solute gives me ValueError: level > 0 or level < -1 only valid with MultiIndex with Pandas 1.3.2.

amc Over a year ago

Maybe my solution will help, @Mead

user394430 · Accepted Answer · 2016-10-24 19:32:50Z

18

As per Niklas' comment, the solution to the question is panel.mean(axis=0).

As a more complete example:

import pandas as pd
import numpy as np

dfs = {}
nrows = 4
ncols = 3
for i in range(4):
    dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
                          columns=list('abc'))
    print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))

panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))

Will give the following output:

DF0:
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

DF1:
    a   b   c
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12

DF2:
    a   b   c
0   2   3   4
1   5   6   7
2   8   9  10
3  11  12  13

DF3:
    a   b   c
0   3   4   5
1   6   7   8
2   9  10  11
3  12  13  14

Mean of stacked DFs:
      a     b     c
0   1.5   2.5   3.5
1   4.5   5.5   6.5
2   7.5   8.5   9.5
3  10.5  11.5  12.5

answered Oct 24, 2016 at 19:32

user394430

2,9972 gold badges30 silver badges28 bronze badges

3 Comments

Tom Walker Over a year ago

Ooh I didn't know about panel. Very useful for blending/stacking dataframes.

Xema Over a year ago

Mind that the class Panel is now deprecated and will be removed in future versions of Pandas.

S.EB Over a year ago

Very useful, thank you very much for sharing your knowledge

ZJS · Accepted Answer · 2014-07-31 17:56:21Z

10

Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...

averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()

And your done...

Or for more general purposes...

dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
             .apply(lambda x:x.mean(),axis=1)\
             .unstack()

edited Jul 31, 2014 at 17:56

answered Jul 31, 2014 at 13:57

ZJS

4,0812 gold badges18 silver badges23 bronze badges

2 Comments

ali_m Over a year ago

That'll work fine for the mean, but it's harder to generalize to other summary stats such as the median, which the OP was also asking for.

ZJS Over a year ago

Didn't keep that in mind when I made the answer but I do think that this solution would absolutely be able to do that you can combine each of the series I made into a dataframe and run the median solution on the frame rowwise yeilding you the seiries which you can unstack

Phillip Cloud · Accepted Answer · 2014-07-31 12:57:05Z

You can simply assign a label to each frame, call it group and then concat and groupby to do what you want:

In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))

In [58]: df2 = df.copy()

In [59]: dfs = [df, df2]

In [60]: df
Out[60]:
        a       b       c       d
0  0.1959  0.1260  0.1464  0.1631
1  0.9344 -1.8154  1.4529 -0.6334
2  0.0390  0.4810  1.1779 -1.1799
3  0.3542  0.3819 -2.0895  0.8877
4 -2.2898 -1.0585  0.8083 -0.2126
5  0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785  0.0885  1.0945 -1.6271
7 -1.7169  0.3760 -1.4078  0.8994
8  0.0508  0.4891  0.0274 -0.6369
9 -0.7019  1.0425 -0.5476 -0.5143

In [61]: for i, d in enumerate(dfs):
   ....:     d['group'] = i
   ....:

In [62]: dfs[0]
Out[62]:
        a       b       c       d  group
0  0.1959  0.1260  0.1464  0.1631      0
1  0.9344 -1.8154  1.4529 -0.6334      0
2  0.0390  0.4810  1.1779 -1.1799      0
3  0.3542  0.3819 -2.0895  0.8877      0
4 -2.2898 -1.0585  0.8083 -0.2126      0
5  0.3727 -0.6867 -1.3440 -1.4849      0
6 -1.1785  0.0885  1.0945 -1.6271      0
7 -1.7169  0.3760 -1.4078  0.8994      0
8  0.0508  0.4891  0.0274 -0.6369      0
9 -0.7019  1.0425 -0.5476 -0.5143      0

In [63]: final = pd.concat(dfs, ignore_index=True)

In [64]: final
Out[64]:
         a       b       c       d  group
0   0.1959  0.1260  0.1464  0.1631      0
1   0.9344 -1.8154  1.4529 -0.6334      0
2   0.0390  0.4810  1.1779 -1.1799      0
3   0.3542  0.3819 -2.0895  0.8877      0
4  -2.2898 -1.0585  0.8083 -0.2126      0
5   0.3727 -0.6867 -1.3440 -1.4849      0
6  -1.1785  0.0885  1.0945 -1.6271      0
..     ...     ...     ...     ...    ...
13  0.3542  0.3819 -2.0895  0.8877      1
14 -2.2898 -1.0585  0.8083 -0.2126      1
15  0.3727 -0.6867 -1.3440 -1.4849      1
16 -1.1785  0.0885  1.0945 -1.6271      1
17 -1.7169  0.3760 -1.4078  0.8994      1
18  0.0508  0.4891  0.0274 -0.6369      1
19 -0.7019  1.0425 -0.5476 -0.5143      1

[20 rows x 5 columns]

In [65]: final.groupby('group').mean()
Out[65]:
           a       b       c       d
group
0     -0.394 -0.0576 -0.0682 -0.4339
1     -0.394 -0.0576 -0.0682 -0.4339

Here, each group is the same, but that's only because df == df2.

Alternatively, you can throw the frames into a Panel:

In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))

In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))

In [71]: panel = pd.Panel({0: df, 1: df2})

In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d

In [73]: panel.mean()
Out[73]:
        0       1
a  0.3839  0.2956
b  0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450

A bit late but it should be "panel.mean(axis=0)" based on what the question was asking. I just came across this post and really like the Panel idea! Thanks for that!

amc · Accepted Answer · 2022-04-11 15:47:56Z

3

With Pandas version 1.3.4 this works for me:

import numpy as np
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100), z=np.random.randint(-3, 2, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 2, 100), z=np.random.randint(-1, 2, 100)))
pd.concat([df1, df2]).groupby(level=0).mean()

answered Apr 11, 2022 at 15:47

amc

8491 gold badge17 silver badges30 bronze badges

Comments

A. John Callegari · Accepted Answer · 2023-09-30 16:33:08Z

0

For computing the mean of a list of pandas DataFrames, I find it simplest to use:
mean_df = sum(df_list)/len(df_list)

For computing median or other stats using pandas, you need to use multiindex grouping in which can be a clunky solution. I prefer to use numpy, which can accept DataFrames as inputs and has a natural mechanism for stacking and aggregating along a new axis that many will be familiar with:
medians = np.median(np.stack(df_list), axis=0)

np.stack() creates a new axis 0
np.median(axis=0) aggregates along the new axis

edited Sep 30, 2023 at 16:33

answered Sep 27, 2023 at 21:20

A. John Callegari

1491 silver badge5 bronze badges

Collectives™ on Stack Overflow

Get the mean across multiple Pandas DataFrames

7 Answers 7

Update

3 Comments

4 Comments

3 Comments

2 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Update

3 Comments

4 Comments

3 Comments

2 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related