Python - Calculate average for every column in a csv file

Question

I'm new in Python and I'm trying to get the average of every (column or row) of a csv file for then select the values that are higher than the double of the average of its column (o row). My file have hundreds of columns, and have float values like these:

845.123,452.234,653.23,...
432.123,213.452.421.532,...
743.234,532,432.423,...

I've tried several changes to my code to get the average for every column (separately), but at the moment my code is like this one:

def AverageColumn (c):
    f=open(csv,"r")
    average=0
    Sum=0
    column=len(f)
    for i in range(0,column):
        for n in i.split(','):
            n=float(n)
            Sum += n
        average = Sum / len(column)
    return 'The average is:', average

    f.close()


csv="MDT25.csv"
print AverageColumn(csv)

But I always get a error like " f has no len()" or "'int' object is not iterable"...

I'd really appreciate if someone show me how to get the average for every column (or row, as you want), and then select the values that are higher than the double of the average of its column (or row). I'd rather without importing modules as csv, but as you prefer. Thanks!

@AdamSmith: Modules will solve the problem. Going without will teach how to code. — Amadan
– Amadan, Commented Sep 1, 2014 at 0:26
@Amadan I've never understood that point of view. If you really believed that, you wouldn't consider any interpreted language "coding," probably would demand to build your compiler yourself, or simply write machine code. — Adam Smith
– Adam Smith, Commented Sep 1, 2014 at 0:28
@AdamSmith: Strawman argument. To learn basics of Python, you should write basics of Python. A library won't teach you not to write things after return, or how to use arrays to have several computations going at once. — Amadan
– Amadan, Commented Sep 1, 2014 at 0:34

Chris Hoekstra · Accepted Answer · 2016-02-05 22:58:32Z

Here's a clean up of your function, but it probably doesn't do what you want it to do. Currently, it is getting the average of all values in all columns:

def average_column (csv):
    f = open(csv,"r")
    average = 0
    Sum = 0
    row_count = 0
    for row in f:
        for column in row.split(','):
            n=float(column)
            Sum += n
        row_count += 1
    average = Sum / len(column)
    f.close()
    return 'The average is:', average

I would use the csv module (which makes csv parsing easier), with a Counter object to manage the column totals and a context manager to open the file (no need for a close()):

import csv
from collections import Counter

def average_column (csv_filepath):
    column_totals = Counter()
    with open(csv_filepath,"rb") as f:
        reader = csv.reader(f)
        row_count = 0.0
        for row in reader:
            for column_idx, column_value in enumerate(row):
                try:
                    n = float(column_value)
                    column_totals[column_idx] += n
                except ValueError:
                    print "Error -- ({}) Column({}) could not be converted to float!".format(column_value, column_idx)                    
            row_count += 1.0            

    # row_count is now 1 too many so decrement it back down
    row_count -= 1.0

    # make sure column index keys are in order
    column_indexes = column_totals.keys()
    column_indexes.sort()

    # calculate per column averages using a list comprehension
    averages = [column_totals[idx]/row_count for idx in column_indexes]
    return averages

Amadan · Accepted Answer · 2014-09-01 00:57:25Z

3

First of all, as people say - CSV format looks simple, but it can be quite nontrivial, especially once strings enter play. monkut already gave you two solutions, the cleaned-up version of your code, and one more that uses CSV library. I'll give yet another option: no libraries, but plenty of idiomatic code to chew on, which gives you averages for all columns at once.

def get_averages(csv):
    column_sums = None
    with open(csv) as file:
        lines = file.readlines()
        rows_of_numbers = [map(float, line.split(',')) for line in lines]
        sums = map(sum, zip(*rows_of_numbers))
        averages = [sum_item / len(lines) for sum_item in sums]
        return averages

Things to note: In your code, f is a file object. You try to close it after you have already returned the value. This code will never be reached: nothing executes after a return has been processed, unless you have a try...finally construct, or with construct (like I am using - which will automatically close the stream).

map(f, l), or equivalent [f(x) for x in l], creates a new list whose elements are obtained by applying function f on each element on l.

f(*l) will "unpack" the list l before function invocation, giving to function f each element as a separate argument.

edited Sep 1, 2014 at 0:57

answered Sep 1, 2014 at 0:50

Amadan

200k23 gold badges252 silver badges321 bronze badges

2 Comments

Pabloo LR Over a year ago

Thanks, I appreciate it, but I get this error: rows_of_numbers = [map(float, line.split(',')) for line in lines] ValueError: could not convert string to float:

Pabloo LR Over a year ago

Ok, I've found the mistake, it was because the last column is different

Sahana M · Accepted Answer · 2019-02-27 16:06:40Z

1

This definitely worked for me!

import numpy as np
import csv

readdata = csv.reader(open('C:\\...\\your_file_name.csv', 'r'))
data = []

for row in readdata:
  data.append(row)

#incase you have a header/title in the first row of your csv file, do the next line else skip it
data.pop(0) 

q1 = []  

for i in range(len(data)):
  q1.append(int(data[i][your_column_number]))

print ('Mean of your_column_number :            ', (np.mean(q1)))

edited Feb 27, 2019 at 16:06

answered Feb 27, 2019 at 15:36

Sahana M

6257 silver badges4 bronze badges

2 Comments

cmprogram Over a year ago

It would be better to inform OP on WHY this worked for you.

Sahana M Over a year ago

It's simple, my csv had 5 columns with first row containing titles of the columns, initially I read the csv file, stored each row into a numpy array(data[]), later I popped out the first row which contained titles(non-numerical values) to not cause error while calculating mean. Took a new numpy array(q1[]) of the particular column I want the mean of , and calculated mean by inbuilt mean function of numpy package

Adam Smith · Accepted Answer · 2014-09-01 00:52:57Z

0

If you want to do it without stdlib modules for some reason:

with open('path/to/csv') as infile:
    columns = list(map(float,next(infile).split(',')))
    for how_many_entries, line in enumerate(infile,start=2):
        for (idx,running_avg), new_data in zip(enumerate(columns), line.split(',')):
            columns[idx] += (float(new_data) - running_avg)/how_many_entries

edited Sep 1, 2014 at 0:52

answered Sep 1, 2014 at 0:47

Adam Smith

54.6k13 gold badges84 silver badges120 bronze badges

Comments

Code-Apprentice · Accepted Answer · 2014-09-01 01:11:34Z

0

I suggest breaking this into several smaller steps:

Read the CSV file into a 2D list or 2D array.
Calculate the averages of each column.

Each of these steps can be implemented as two separate functions. (In a realistic situation where the CSV file is large, reading the complete file into memory might be prohibitive due to space constraints. However, for a learning exercise, this is a great way to gain an understanding of writing your own functions.)

answered Sep 1, 2014 at 1:11

Code-Apprentice

84k26 gold badges162 silver badges289 bronze badges

Comments

John Wilkins · Accepted Answer · 2018-04-07 18:02:32Z

I hope this helps you out......Some help....here is what I would do - which is use numpy:

    # ==========================
    import numpy as np
    import csv as csv

    #  Assume that you have 2 columns and a header-row: The Columns are (1) 
    #  question # ...1; (2) question 2
    # ========================================

    readdata = csv.reader(open('filename.csv', 'r'))  #this is the file you 
    # ....will write your original file to....============
    data = []
    for row in readdata:
    data.append(row)
    Header = data[0]
    data.pop(0)
    q1 = []
    q2 = []
    # ========================================

    for i in range(len(data)):
        q1.append(int(data[i][1]))
        q2.append(int(data[i][2]))
    # ========================================
    # ========================================
    # === Means/Variance - Work-up Section ===
    # ========================================
    print ('Mean - Question-1:            ', (np.mean(q1)))
    print ('Variance,Question-1:          ', (np.var(q1)))
    print ('==============================================')
    print ('Mean - Question-2:            ', (np.mean(q2)))
    print ('Variance,Question-2:          ', (np.var(q2)))

RoyalBigMack · Accepted Answer · 2022-08-25 08:01:57Z

0

import csv
from statistics import mean
with open(r'path/to/csv','r') as f:
    reader = csv.reader(f)
    print(mean([float(i[2]) for i in reader if i[2].isnumeric()]))

replace '2' with the index of the column you'd wish to calculate

answered Aug 25, 2022 at 8:01

RoyalBigMack

7368 silver badges8 bronze badges

Collectives™ on Stack Overflow

Python - Calculate average for every column in a csv file

7 Answers 7

Comments

2 Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

2 Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related