1

I have the following df:

  import numpy as np
  import pandas as pd
  a = [] 
  for i in range(5):
      tmp_df = pd.DataFrame(np.random.random((10,4)))
      tmp_df['lvl'] = i
      a.append(tmp_df) 
  df = pd.concat(a, axis=0)

df =

          0         1         2         3  lvl
0  0.928623  0.868600  0.854186  0.129116    0
1  0.667870  0.901285  0.539412  0.883890    0
2  0.384494  0.697995  0.242959  0.725847    0
3  0.993400  0.695436  0.596957  0.142975    0
4  0.518237  0.550585  0.426362  0.766760    0
5  0.359842  0.417702  0.873988  0.217259    0
6  0.820216  0.823426  0.585223  0.553131    0
7  0.492683  0.401155  0.479228  0.506862    0
..............................................   
3  0.505096  0.426465  0.356006  0.584958    3
4  0.145472  0.558932  0.636995  0.318406    3
5  0.957969  0.068841  0.612658  0.184291    3
6  0.059908  0.298270  0.334564  0.738438    3
7  0.662056  0.074136  0.244039  0.848246    3
8  0.997610  0.043430  0.774946  0.097294    3
9  0.795873  0.977817  0.780772  0.849418    3
0  0.577173  0.430014  0.133300  0.760223    4
1  0.916126  0.623035  0.240492  0.638203    4
2  0.165028  0.626054  0.225580  0.356118    4
3  0.104375  0.137684  0.084631  0.987290    4
4  0.934663  0.835608  0.764334  0.651370    4
5  0.743265  0.072671  0.911947  0.925644    4
6  0.212196  0.587033  0.230939  0.994131    4
7  0.945275  0.238572  0.696123  0.536136    4
8  0.989021  0.073608  0.720132  0.254656    4
9  0.513966  0.666534  0.270577  0.055597    4

I am learning neat pandas functionality and thus wondering, what is the easiest way to compute average along lvl column?

What I mean is:

(df[df.lvl ==0 ] + df[df.lvl ==1 ] + df[df.lvl ==2 ] + df[df.lvl ==3 ] + df[df.lvl ==4 ]) / 5

The desired output should be a table of shape (10,4), without the column lvl, where each element is the average of 5 elements (with lvl = [0,1,2,3,4]. I hope it helps.

1
  • 1
    can you provide the desired output with maybe 3 or 4 lines of sample data? Commented Mar 15, 2018 at 13:34

3 Answers 3

1

I think need:

np.random.seed(456)
a = [] 
for i in range(5):
    tmp_df = pd.DataFrame(np.random.random((10,4)))
    tmp_df['lvl'] = i
    a.append(tmp_df) 
df = pd.concat(a, axis=0)
#print (df)

df1 = (df[df.lvl ==0 ] + df[df.lvl ==1 ] + 
       df[df.lvl ==2 ] + df[df.lvl ==3 ] + 
       df[df.lvl ==4 ]) / 5
print (df1)
          0         1         2         3  lvl
0  0.411557  0.520560  0.578900  0.541576    2
1  0.253469  0.655714  0.532784  0.620744    2
2  0.468099  0.576198  0.400485  0.333533    2
3  0.620207  0.367649  0.531639  0.475587    2
4  0.699554  0.548005  0.683745  0.457997    2
5  0.322487  0.316137  0.489660  0.362146    2
6  0.430058  0.159712  0.631610  0.641141    2
7  0.399944  0.511944  0.346402  0.754591    2
8  0.400190  0.373925  0.340727  0.407988    2
9  0.502879  0.399614  0.321710  0.715812    2

df = df.set_index('lvl')
df2 = df.groupby(df.groupby('lvl').cumcount()).mean()
print (df2)
          0         1         2         3
0  0.411557  0.520560  0.578900  0.541576
1  0.253469  0.655714  0.532784  0.620744
2  0.468099  0.576198  0.400485  0.333533
3  0.620207  0.367649  0.531639  0.475587
4  0.699554  0.548005  0.683745  0.457997
5  0.322487  0.316137  0.489660  0.362146
6  0.430058  0.159712  0.631610  0.641141
7  0.399944  0.511944  0.346402  0.754591
8  0.400190  0.373925  0.340727  0.407988
9  0.502879  0.399614  0.321710  0.715812

EDIT:

If each subset of DataFrame have index from 0 to len(subset):

df2 = df.mean(level=0)
print (df2)
          0         1         2         3  lvl
0  0.411557  0.520560  0.578900  0.541576    2
1  0.253469  0.655714  0.532784  0.620744    2
2  0.468099  0.576198  0.400485  0.333533    2
3  0.620207  0.367649  0.531639  0.475587    2
4  0.699554  0.548005  0.683745  0.457997    2
5  0.322487  0.316137  0.489660  0.362146    2
6  0.430058  0.159712  0.631610  0.641141    2
7  0.399944  0.511944  0.346402  0.754591    2
8  0.400190  0.373925  0.340727  0.407988    2
9  0.502879  0.399614  0.321710  0.715812    2
Sign up to request clarification or add additional context in comments.

1 Comment

Fab! I made a typo and it should be: df.groupby(df.groupby('lvl').cumcount()).mean()
1

The groupby function is exactly what you want. It will group based on a condition, in this case where 'lvl' is the same, and then apply the mean function to the values for each column in that group.

df.groupby('lvl').mean()

2 Comments

Thanks, but I think I confused you. I need to compute the average 'along' lvl parameter, not within. So at the end I need to get a single matrix of size (10,4)
Ah that makes sense. Whoops.
1

it seems like you want to group by the index and take average of all the columns except lvl

i.e.

df.groupby(df.index)[[0,1,2,3]].mean()

For a dataframe generated using

np.random.seed(456)
a = [] 
for i in range(5):
    tmp_df = pd.DataFrame(np.random.random((10,4)))
    tmp_df['lvl'] = i
    a.append(tmp_df) 
df = pd.concat(a, axis=0)

df.groupby(df.index)[[0,1,2,3]].mean()

outputs:

          0         1         2         3
0  0.411557  0.520560  0.578900  0.541576
1  0.253469  0.655714  0.532784  0.620744
2  0.468099  0.576198  0.400485  0.333533
3  0.620207  0.367649  0.531639  0.475587
4  0.699554  0.548005  0.683745  0.457997
5  0.322487  0.316137  0.489660  0.362146
6  0.430058  0.159712  0.631610  0.641141
7  0.399944  0.511944  0.346402  0.754591
8  0.400190  0.373925  0.340727  0.407988
9  0.502879  0.399614  0.321710  0.715812

which is identical to the output from

df.groupby(df.groupby('lvl').cumcount()).mean()

without resorting to double groupby.

IMO this is cleaner to read and will for large dataframe, will be much faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.