2

Hi I have a file that consists of too many columns to open in excel. Each column has 10 rows of numerical values 0-2 and has a row saying the title of the column. I would like the output to be the name of the column and the average value of the 10 rows. The file is too large to open in excel 2000 so I have to try using python. Any tips on an easy way to do this.

Here is a sample of the first 3 columns:

Trial1 Trial2 Trial3

1 0 1

0 0 0

0 2 0

2 2 2

1 1 1

1 0 1

0 0 0

0 2 0

2 2 2

1 1 1

I want python to output as a test file

Trial 1 Trial 2 Trial 3 1 2 1 (whatever the averages are)

2
  • Should the average be given as an integer? Can you say what you want the result to be for your example data? Commented Sep 11, 2010 at 23:11
  • Are those empty lines really in your input file? Commented Sep 11, 2010 at 23:12

4 Answers 4

2

A memory-friendly solution without using any modules:

with open("filename", "rtU") as f:
    columns = f.readline().strip().split(" ")
    numRows = 0
    sums = [0] * len(columns)

    for line in f:
        # Skip empty lines
        if not line.strip():
            continue

        values = line.split(" ")
        for i in xrange(len(values)):
            sums[i] += int(values[i])
        numRows += 1

    for index, summedRowValue in enumerate(sums):
        print columns[index], 1.0 * summedRowValue / numRows
Sign up to request clarification or add additional context in comments.

4 Comments

No need to use f.xreadlines(). for line in f: is exactly equivalent, and works in both python 2.x and 3.x.
Ok I'm having issues with this because the lines are sepearted by a TAB and not a space. So I made the spacinjg between the ""s longer to look like a tab and it didn't work. I got this error instead
raceback (most recent call last): File "C:/avy5.py", line 13, in <module> sums[i] += int(values[i]) ValueError: invalid literal for int() with base 10: '001\t001.0037\t001.1070\t001.1000\t2\t2\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t0\t0\t1\t1\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t1\t1\t1\t1\t0\t0\t1\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t1\t1\t0\t0\t0'
@Robert A. Fettikowski: A tab is not 4 spaces - a tab is a single character. Tabs are usually escaped as "\t" as you can clearly see. Just change the split string to "\t".
2

You can use Numpy:

import numpy as np
from StringIO import StringIO

s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')

data = np.loadtxt(s, skiprows=1)  # skip header row
print data.mean(axis=0)  # column means

# OUTPUT: array([ 0.8,  1. ,  0.8])

Note that the first argument to loadtxt could be the name of your file instead of a file like object.

Comments

1

You can use the builtin csv module:

import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]

def average(l):
    return float(sum(l)) / len(l)

averages = [int(round(average(trial))) for trial in zip(*values)]

print ' '.join(headers)
print ' '.join(str(x) for x in averages)

Result:

Trial1 Trial2 Trial3
1 1 1

2 Comments

Nice solution. But this might consume a lot of memory (the values list) if the file is really as large so that Excel can't open it.
This didn't print the results?
0

Less of an answer than it is an alternative understanding of the problem:

You could think of each line being a vector. In this way, the average done column-by-column is just the average of each of these vectors. All you need in order to do this is

  1. A way to read a line into a vector object,

  2. A vector addition operation,

  3. Scalar multiplication (or division) of vectors.

Python comes (I think) with most of this already installed, but this should lead to some easily readable code.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.