Average data based on specific columns - python

Question

I have a data file with multiple rows, and 8 columns - I want to average column 8 of rows that have the same data on columns 1, 2, 5 - for example my file can look like this:

564645  7371810 0   21642   1530    1   2   30.8007
564645  7371810 0   21642   8250    1   2   0.0103
564645  7371810 0   21643   1530    1   2   19.3619

I want to average the last column of the first and third row since columns 1-2-5 are identical;

I want the output to look like this:

564645  7371810 0   21642   1530    1   2   25.0813
564645  7371810 0   21642   8250    1   2   0.0103

my files (text files) are pretty big (~10000 lines) and redundant data (based on the above rule) are not in regular intervals - so I want the code to find the redundant data, and average them...

in response to larsks comment - here are my 4 lines of code...

import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)

##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]

@larsks fair question, with an unfortunate answer - over the last 1 hr 15' the only thing I achieved was to sort my data, based on the columns of interest 'dataset = open(input('dataset_to_be_used, ')).readlines() data = np.loadtxt(dataset) datasort = np.lexsort((data[:,0],data[:,1],data[:,4])) datasorted = data[datasort]' I am not proud but this is as far as I have gone...... — Dimitris
– Dimitris, Commented Dec 14, 2012 at 3:33
What is np? I'm guessing numpy, but your code doesn't show an import so it's hard to be sure. — larsks
– larsks, Commented Dec 14, 2012 at 3:40

HYRY · Accepted Answer · 2012-12-14 03:49:36Z

0

you can use pandas to do this quickly:

import pandas as pd
from StringIO import StringIO
data = StringIO("""564645  7371810 0   21642   1530    1   2   30.8007
564645  7371810 0   21642   8250    1   2   0.0103
564645  7371810 0   21643   1530    1   2   19.3619
""")
df = pd.read_csv(data, sep="\\s+", header=None)
df.groupby(["X.1","X.2","X.5"])["X.8"].mean()

the output is:

X.1     X.2      X.5 
564645  7371810  1530    25.0813
                 8250     0.0103
Name: X.8

if you don't need index, you can call:

df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()

this will give the result as:

      X.1      X.2   X.5      X.8
0  564645  7371810  1530  25.0813
1  564645  7371810  8250   0.0103

answered Dec 14, 2012 at 3:49

HYRY

97.8k28 gold badges197 silver badges192 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dimitris Over a year ago

thanks this seems that it will work; I tried it but I get an error - when I am trying 'data = StringIO()' I am loosing the data variable... I will have to figure it out tomorrow.. thanks for the guidance

HYRY Over a year ago

You don't need StringIO, I use StringIO for the example data, you can call pd.read_csv(filename, sep="\\s+", header=None), where filename is the path to the text data file.

Dimitris Over a year ago

Thanks Hury, this works!! I am modifying it, to fit my needs - I will post my final code since I am pretty sure I do not right it in the most efficient way... Thanks

Dimitris · Accepted Answer · 2012-12-14 21:06:15Z

Ok, based on Hury's input I updated the code -

import os #needed system utils
import numpy as np# for array data processing
import pandas as pd #import the pandas module
datadirectory = input('path to the data directory, ')
working = os.environ.get("WORKING_DIRECTORY", datadirectory) 
os.chdir( working)

 ##READ DATA FILE AND and convert it to string
dataset = open(input('dataset_to_be_used, ')).readlines()
data = ''.join(dataset) 

df = pd.read_csv(data, sep="\\s+", header=None)
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)

this worked with the test data, as posted by hury - but when I use my file after the df = ... does not seem to work (I get an output like:

Traceback (most recent call last): File "/media/DATA/arxeia/Programming/MyPys/data_refine_average.py", line 31, in df = pd.read_csv(data, sep="\s+", header=None) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv return _read(TextParser, filepath_or_buffer, kwds) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 141, in _read f = com._get_handle(filepath_or_buffer, 'r', encoding=encoding) File "/usr/lib64/python2.7/site-packages/pandas/core/common.py", line 673, in _get_handle f = open(path, mode) IOError: [Errno 36] File name too long: '564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r\n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0.0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t0\t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216..........

any ideas?

ded · Accepted Answer · 2012-12-14 22:55:50Z

0

It's not the most elegant of answers, and I have no idea how fast/efficient it is, but I believe it gets the job done based on the information you provided:

import numpy

data_file = "full_location_of_data_file"
data_dict = {}
for line in open(data_file):
    line = line.rstrip()
    columns = line.split()
    entry = [columns[0], columns[1], columns[4]]
    entry = "-".join(entry)
    try: #valid if have already seen combination of 1,2,5
        x = data_dict[entry].append(float(columns[7]))
    except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
        data_dict[entry] = [float(columns[7])]

for entry in data_dict:
    value = numpy.mean(data_dict[entry])   
    output = entry.split("-")
    output.append(str(value))
    output = "\t".join(output)
    print output

I'm unclear if you want/need columns 3, 6, or 7 so I omited them. Particularly, you do not make clear how you want to deal with different values which may exist within them. If you can elaborate on what behavior you want (ie default to a certain value, or to the first occurrence) I'd suggest either filling in with default values or store the first instance in a dictionary of dictionaries rather than a dictionary of lists.

edited Dec 14, 2012 at 22:55

answered Dec 14, 2012 at 22:50

ded

4302 silver badges13 bronze badges

5 Comments

Dimitris Over a year ago

Hi ded, thanks for the piece of code - I am almost no familiar at all with dictionaries, so I will have to read about it - I will give it a go and post a response / update

ded Over a year ago

@Dimitris when I was first starting out, I thought of dictionaries as non-ordered lists where instead of accessing an item by number (ie list[0] returns the first item of a list), you access it by something you assign to it (ie Name_dictionary['me'] returns ded, Name_dictionary['you'] returns Dimitris. If you deselect the answer by HYRY as 'accepted' i think this may get more attention.

Dimitris Over a year ago

thanks for the introduction to dictionaries - and the suggestion on the "accepted" term (although I will give credit to both of you when I am fully finishes since both suggestions seem to work) I still have the same problem - my text file appears not to be iterable and I get the result I posted below (even from your code) - I am working on that now..

ded Over a year ago

@Dimitris i don't think you should have been able to get the same error in both cases since neither of us used the same thing... I also see several references to "line 187, 141, 637 which seems to be much longer than you should need for either solution. Are you trying to do something else with the code beyond what you asked? Can you post exactly what your code is?

Dimitris Over a year ago

I will post my code right now - I solved the original problem (see my posted code) - the outpout I want is a text file with the original columns, without the redundant rows, and the average values on column 8 (7 if you start counting with 0 as in python)

Community · Accepted Answer · 2020-06-20 09:12:55Z

import os #needed system utils
import numpy as np# for array data processing


datadirectory = '/media/DATA/arxeia/Dimitris/Testing/12_11'
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)

##HERE I WAS TRYING TO READ THE FILE, AND THEN USE THE NAME OF THE STRING IN THE FOLLOWING LINE - THAT RESULTED IN THE SAME ERROR DESCRIBED BELOW (ERROR # 42 (I think) - too large name)

data_dict = {} #Create empty dictionary
for line in open('/media/DATA/arxeia/Dimitris/Testing/12_11/1a.dat'): ##above error resolved when used this
    line = line.rstrip()
    columns = line.split()
    entry = [columns[0], columns[1], columns[4]]
    entry = "-".join(entry)
    try: #valid if have already seen combination of 1,2,5
        x = data_dict[entry].append(float(columns[7])) 
    except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
        data_dict[entry] = [float(columns[7])]

for entry in data_dict:
    value = np.mean(data_dict[entry])   
    output = entry.split("-")
    output.append(str(value))
    output = "\t".join(output)
   print output

MY OTHER PROBLEM NOW IS GETTING OUTPUT IN STRING FORMAT (OR ANY FORMAT) - THEN I BELIEVE I KNOW I CAN GET TO THE SAVE PART AND MANIPULATE THE FINAL FORMAT

np.savetxt('sorted_data.dat', sorted, fmt='%s', delimiter='\t') #Save the data

Collectives™ on Stack Overflow

Average data based on specific columns - python

4 Answers 4

3 Comments

Comments

5 Comments

MY OTHER PROBLEM NOW IS GETTING OUTPUT IN STRING FORMAT (OR ANY FORMAT) - THEN I BELIEVE I KNOW I CAN GET TO THE SAVE PART AND MANIPULATE THE FINAL FORMAT

I STILL HAVE TO FIGURE HOW TO ADD THE OTHER COLUMNS - I AM WORKING ON THAT TOO

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

5 Comments

MY OTHER PROBLEM NOW IS GETTING OUTPUT IN STRING FORMAT (OR ANY FORMAT) - THEN I BELIEVE I KNOW I CAN GET TO THE SAVE PART AND MANIPULATE THE FINAL FORMAT

I STILL HAVE TO FIGURE HOW TO ADD THE OTHER COLUMNS - I AM WORKING ON THAT TOO

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related