4

I'm new in Python and I'm trying to get the average of every (column or row) of a csv file for then select the values that are higher than the double of the average of its column (o row). My file have hundreds of columns, and have float values like these:

845.123,452.234,653.23,...
432.123,213.452.421.532,...
743.234,532,432.423,...

I've tried several changes to my code to get the average for every column (separately), but at the moment my code is like this one:

def AverageColumn (c):
    f=open(csv,"r")
    average=0
    Sum=0
    column=len(f)
    for i in range(0,column):
        for n in i.split(','):
            n=float(n)
            Sum += n
        average = Sum / len(column)
    return 'The average is:', average

    f.close()


csv="MDT25.csv"
print AverageColumn(csv)

But I always get a error like " f has no len()" or "'int' object is not iterable"...

I'd really appreciate if someone show me how to get the average for every column (or row, as you want), and then select the values that are higher than the double of the average of its column (or row). I'd rather without importing modules as csv, but as you prefer. Thanks!

4
  • why don't you want to use stdlib modules (e.g. csv?) Commented Sep 1, 2014 at 0:26
  • 1
    @AdamSmith: Modules will solve the problem. Going without will teach how to code. Commented Sep 1, 2014 at 0:26
  • 1
    @Amadan I've never understood that point of view. If you really believed that, you wouldn't consider any interpreted language "coding," probably would demand to build your compiler yourself, or simply write machine code. Commented Sep 1, 2014 at 0:28
  • @AdamSmith: Strawman argument. To learn basics of Python, you should write basics of Python. A library won't teach you not to write things after return, or how to use arrays to have several computations going at once. Commented Sep 1, 2014 at 0:34

7 Answers 7

5

Here's a clean up of your function, but it probably doesn't do what you want it to do. Currently, it is getting the average of all values in all columns:

def average_column (csv):
    f = open(csv,"r")
    average = 0
    Sum = 0
    row_count = 0
    for row in f:
        for column in row.split(','):
            n=float(column)
            Sum += n
        row_count += 1
    average = Sum / len(column)
    f.close()
    return 'The average is:', average

I would use the csv module (which makes csv parsing easier), with a Counter object to manage the column totals and a context manager to open the file (no need for a close()):

import csv
from collections import Counter

def average_column (csv_filepath):
    column_totals = Counter()
    with open(csv_filepath,"rb") as f:
        reader = csv.reader(f)
        row_count = 0.0
        for row in reader:
            for column_idx, column_value in enumerate(row):
                try:
                    n = float(column_value)
                    column_totals[column_idx] += n
                except ValueError:
                    print "Error -- ({}) Column({}) could not be converted to float!".format(column_value, column_idx)                    
            row_count += 1.0            

    # row_count is now 1 too many so decrement it back down
    row_count -= 1.0

    # make sure column index keys are in order
    column_indexes = column_totals.keys()
    column_indexes.sort()

    # calculate per column averages using a list comprehension
    averages = [column_totals[idx]/row_count for idx in column_indexes]
    return averages
Sign up to request clarification or add additional context in comments.

Comments

3

First of all, as people say - CSV format looks simple, but it can be quite nontrivial, especially once strings enter play. monkut already gave you two solutions, the cleaned-up version of your code, and one more that uses CSV library. I'll give yet another option: no libraries, but plenty of idiomatic code to chew on, which gives you averages for all columns at once.

def get_averages(csv):
    column_sums = None
    with open(csv) as file:
        lines = file.readlines()
        rows_of_numbers = [map(float, line.split(',')) for line in lines]
        sums = map(sum, zip(*rows_of_numbers))
        averages = [sum_item / len(lines) for sum_item in sums]
        return averages

Things to note: In your code, f is a file object. You try to close it after you have already returned the value. This code will never be reached: nothing executes after a return has been processed, unless you have a try...finally construct, or with construct (like I am using - which will automatically close the stream).

map(f, l), or equivalent [f(x) for x in l], creates a new list whose elements are obtained by applying function f on each element on l.

f(*l) will "unpack" the list l before function invocation, giving to function f each element as a separate argument.

2 Comments

Thanks, I appreciate it, but I get this error: rows_of_numbers = [map(float, line.split(',')) for line in lines] ValueError: could not convert string to float:
Ok, I've found the mistake, it was because the last column is different
1

This definitely worked for me!

import numpy as np
import csv

readdata = csv.reader(open('C:\\...\\your_file_name.csv', 'r'))
data = []

for row in readdata:
  data.append(row)

#incase you have a header/title in the first row of your csv file, do the next line else skip it
data.pop(0) 

q1 = []  

for i in range(len(data)):
  q1.append(int(data[i][your_column_number]))

print ('Mean of your_column_number :            ', (np.mean(q1)))

2 Comments

It would be better to inform OP on WHY this worked for you.
It's simple, my csv had 5 columns with first row containing titles of the columns, initially I read the csv file, stored each row into a numpy array(data[]), later I popped out the first row which contained titles(non-numerical values) to not cause error while calculating mean. Took a new numpy array(q1[]) of the particular column I want the mean of , and calculated mean by inbuilt mean function of numpy package
0

If you want to do it without stdlib modules for some reason:

with open('path/to/csv') as infile:
    columns = list(map(float,next(infile).split(',')))
    for how_many_entries, line in enumerate(infile,start=2):
        for (idx,running_avg), new_data in zip(enumerate(columns), line.split(',')):
            columns[idx] += (float(new_data) - running_avg)/how_many_entries

Comments

0

I suggest breaking this into several smaller steps:

  1. Read the CSV file into a 2D list or 2D array.
  2. Calculate the averages of each column.

Each of these steps can be implemented as two separate functions. (In a realistic situation where the CSV file is large, reading the complete file into memory might be prohibitive due to space constraints. However, for a learning exercise, this is a great way to gain an understanding of writing your own functions.)

Comments

0

I hope this helps you out......Some help....here is what I would do - which is use numpy:

    # ==========================
    import numpy as np
    import csv as csv

    #  Assume that you have 2 columns and a header-row: The Columns are (1) 
    #  question # ...1; (2) question 2
    # ========================================

    readdata = csv.reader(open('filename.csv', 'r'))  #this is the file you 
    # ....will write your original file to....============
    data = []
    for row in readdata:
    data.append(row)
    Header = data[0]
    data.pop(0)
    q1 = []
    q2 = []
    # ========================================

    for i in range(len(data)):
        q1.append(int(data[i][1]))
        q2.append(int(data[i][2]))
    # ========================================
    # ========================================
    # === Means/Variance - Work-up Section ===
    # ========================================
    print ('Mean - Question-1:            ', (np.mean(q1)))
    print ('Variance,Question-1:          ', (np.var(q1)))
    print ('==============================================')
    print ('Mean - Question-2:            ', (np.mean(q2)))
    print ('Variance,Question-2:          ', (np.var(q2)))

Comments

0
import csv
from statistics import mean
with open(r'path/to/csv','r') as f:
    reader = csv.reader(f)
    print(mean([float(i[2]) for i in reader if i[2].isnumeric()]))

replace '2' with the index of the column you'd wish to calculate

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.