How to find the average of multiple columns in a file using python

Question

Hi I have a file that consists of too many columns to open in excel. Each column has 10 rows of numerical values 0-2 and has a row saying the title of the column. I would like the output to be the name of the column and the average value of the 10 rows. The file is too large to open in excel 2000 so I have to try using python. Any tips on an easy way to do this.

Here is a sample of the first 3 columns:

Trial1 Trial2 Trial3

1 0 1

0 0 0

0 2 0

2 2 2

1 1 1

1 0 1

0 0 0

0 2 0

2 2 2

1 1 1

I want python to output as a test file

Trial 1 Trial 2 Trial 3 1 2 1 (whatever the averages are)

Should the average be given as an integer? Can you say what you want the result to be for your example data? — Mark Byers
– Mark Byers, Commented Sep 11, 2010 at 23:11

AndiDog · Accepted Answer · 2010-09-12 09:03:05Z

2

A memory-friendly solution without using any modules:

with open("filename", "rtU") as f:
    columns = f.readline().strip().split(" ")
    numRows = 0
    sums = [0] * len(columns)

    for line in f:
        # Skip empty lines
        if not line.strip():
            continue

        values = line.split(" ")
        for i in xrange(len(values)):
            sums[i] += int(values[i])
        numRows += 1

    for index, summedRowValue in enumerate(sums):
        print columns[index], 1.0 * summedRowValue / numRows

edited Sep 12, 2010 at 9:03

answered Sep 11, 2010 at 23:10

AndiDog

70.6k21 gold badges166 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Joe Kington Over a year ago

No need to use f.xreadlines(). for line in f: is exactly equivalent, and works in both python 2.x and 3.x.

Robert A. Fettikowski Over a year ago

Ok I'm having issues with this because the lines are sepearted by a TAB and not a space. So I made the spacinjg between the ""s longer to look like a tab and it didn't work. I got this error instead

Robert A. Fettikowski Over a year ago

raceback (most recent call last): File "C:/avy5.py", line 13, in <module> sums[i] += int(values[i]) ValueError: invalid literal for int() with base 10: '001\t001.0037\t001.1070\t001.1000\t2\t2\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t0\t0\t1\t1\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t1\t1\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0'

AndiDog Over a year ago

@Robert A. Fettikowski: A tab is not 4 spaces - a tab is a single character. Tabs are usually escaped as "\t" as you can clearly see. Just change the split string to "\t".

ars · Accepted Answer · 2010-09-11 23:04:58Z

2

You can use Numpy:

import numpy as np
from StringIO import StringIO

s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')

data = np.loadtxt(s, skiprows=1)  # skip header row
print data.mean(axis=0)  # column means

# OUTPUT: array([ 0.8,  1. ,  0.8])

Note that the first argument to loadtxt could be the name of your file instead of a file like object.

answered Sep 11, 2010 at 23:04

ars

124k23 gold badges151 silver badges135 bronze badges

Comments

Mark Byers · Accepted Answer · 2010-09-11 23:08:12Z

1

You can use the builtin csv module:

import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]

def average(l):
    return float(sum(l)) / len(l)

averages = [int(round(average(trial))) for trial in zip(*values)]

print ' '.join(headers)
print ' '.join(str(x) for x in averages)

Result:

Trial1 Trial2 Trial3
1 1 1

answered Sep 11, 2010 at 23:08

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

2 Comments

AndiDog Over a year ago

Nice solution. But this might consume a lot of memory (the values list) if the file is really as large so that Excel can't open it.

Robert A. Fettikowski Over a year ago

This didn't print the results?

ghollisjr · Accepted Answer · 2013-10-16 12:35:12Z

0

Less of an answer than it is an alternative understanding of the problem:

You could think of each line being a vector. In this way, the average done column-by-column is just the average of each of these vectors. All you need in order to do this is

A way to read a line into a vector object,
A vector addition operation,
Scalar multiplication (or division) of vectors.

Python comes (I think) with most of this already installed, but this should lead to some easily readable code.

answered Oct 16, 2013 at 12:35

ghollisjr

1537 bronze badges

Collectives™ on Stack Overflow

How to find the average of multiple columns in a file using python

4 Answers 4

4 Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related