1

I have a data file with multiple rows, and 8 columns - I want to average column 8 of rows that have the same data on columns 1, 2, 5 - for example my file can look like this:

564645  7371810 0   21642   1530    1   2   30.8007
564645  7371810 0   21642   8250    1   2   0.0103
564645  7371810 0   21643   1530    1   2   19.3619

I want to average the last column of the first and third row since columns 1-2-5 are identical;

I want the output to look like this:

564645  7371810 0   21642   1530    1   2   25.0813
564645  7371810 0   21642   8250    1   2   0.0103

my files (text files) are pretty big (~10000 lines) and redundant data (based on the above rule) are not in regular intervals - so I want the code to find the redundant data, and average them...

in response to larsks comment - here are my 4 lines of code...

import os
import numpy as np
datadirectory = input('path to the data directory, ')
os.chdir( datadirectory)

##READ DATA FILE AND CREATE AN ARRAY
dataset = open(input('dataset_to_be_used, ')).readlines()
data = np.loadtxt(dataset)
##Sort the data based on common X, Y and frequency
datasort = np.lexsort((data[:,0],data[:,1],data[:,4]))
datasorted = data[datasort]
3
  • Can you show us what you've tried so far? Commented Dec 14, 2012 at 3:23
  • @larsks fair question, with an unfortunate answer - over the last 1 hr 15' the only thing I achieved was to sort my data, based on the columns of interest 'dataset = open(input('dataset_to_be_used, ')).readlines() data = np.loadtxt(dataset) datasort = np.lexsort((data[:,0],data[:,1],data[:,4])) datasorted = data[datasort]' I am not proud but this is as far as I have gone...... Commented Dec 14, 2012 at 3:33
  • What is np? I'm guessing numpy, but your code doesn't show an import so it's hard to be sure. Commented Dec 14, 2012 at 3:40

4 Answers 4

0

you can use pandas to do this quickly:

import pandas as pd
from StringIO import StringIO
data = StringIO("""564645  7371810 0   21642   1530    1   2   30.8007
564645  7371810 0   21642   8250    1   2   0.0103
564645  7371810 0   21643   1530    1   2   19.3619
""")
df = pd.read_csv(data, sep="\\s+", header=None)
df.groupby(["X.1","X.2","X.5"])["X.8"].mean()

the output is:

X.1     X.2      X.5 
564645  7371810  1530    25.0813
                 8250     0.0103
Name: X.8

if you don't need index, you can call:

df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()

this will give the result as:

      X.1      X.2   X.5      X.8
0  564645  7371810  1530  25.0813
1  564645  7371810  8250   0.0103
Sign up to request clarification or add additional context in comments.

3 Comments

thanks this seems that it will work; I tried it but I get an error - when I am trying 'data = StringIO()' I am loosing the data variable... I will have to figure it out tomorrow.. thanks for the guidance
You don't need StringIO, I use StringIO for the example data, you can call pd.read_csv(filename, sep="\\s+", header=None), where filename is the path to the text data file.
Thanks Hury, this works!! I am modifying it, to fit my needs - I will post my final code since I am pretty sure I do not right it in the most efficient way... Thanks
0

Ok, based on Hury's input I updated the code -

import os #needed system utils
import numpy as np# for array data processing
import pandas as pd #import the pandas module
datadirectory = input('path to the data directory, ')
working = os.environ.get("WORKING_DIRECTORY", datadirectory) 
os.chdir( working)

 ##READ DATA FILE AND and convert it to string
dataset = open(input('dataset_to_be_used, ')).readlines()
data = ''.join(dataset) 

df = pd.read_csv(data, sep="\\s+", header=None)
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)

this worked with the test data, as posted by hury - but when I use my file after the df = ... does not seem to work (I get an output like:

Traceback (most recent call last): File "/media/DATA/arxeia/Programming/MyPys/data_refine_average.py", line 31, in df = pd.read_csv(data, sep="\s+", header=None) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv return _read(TextParser, filepath_or_buffer, kwds) File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 141, in _read f = com._get_handle(filepath_or_buffer, 'r', encoding=encoding) File "/usr/lib64/python2.7/site-packages/pandas/core/common.py", line 673, in _get_handle f = open(path, mode) IOError: [Errno 36] File name too long: '564645\t7371810\t0\t21642\t1530\t1\t2\t30.8007\r\n564645\t7371810\t0\t21642\t8250\t1\t2\t0.0103\r\n564645\t7371810\t0\t21642\t20370\t1\t2\t0.0042\r\n564645\t7371810\t0\t21642\t33030\t1\t2\t0.0026\r\n564645\t7371810\t0\t21642\t47970\t1\t2\t0.0018\r\n564645\t7371810\t0\t21642\t63090\t1\t2\t0.0013\r\n564645\t7371810\t0\t21642\t93090\t1\t2\t0.0009\r\n564645\t7371810\t0\t216..........

any ideas?

Comments

0

It's not the most elegant of answers, and I have no idea how fast/efficient it is, but I believe it gets the job done based on the information you provided:

import numpy

data_file = "full_location_of_data_file"
data_dict = {}
for line in open(data_file):
    line = line.rstrip()
    columns = line.split()
    entry = [columns[0], columns[1], columns[4]]
    entry = "-".join(entry)
    try: #valid if have already seen combination of 1,2,5
        x = data_dict[entry].append(float(columns[7]))
    except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
        data_dict[entry] = [float(columns[7])]

for entry in data_dict:
    value = numpy.mean(data_dict[entry])   
    output = entry.split("-")
    output.append(str(value))
    output = "\t".join(output)
    print output

I'm unclear if you want/need columns 3, 6, or 7 so I omited them. Particularly, you do not make clear how you want to deal with different values which may exist within them. If you can elaborate on what behavior you want (ie default to a certain value, or to the first occurrence) I'd suggest either filling in with default values or store the first instance in a dictionary of dictionaries rather than a dictionary of lists.

5 Comments

Hi ded, thanks for the piece of code - I am almost no familiar at all with dictionaries, so I will have to read about it - I will give it a go and post a response / update
@Dimitris when I was first starting out, I thought of dictionaries as non-ordered lists where instead of accessing an item by number (ie list[0] returns the first item of a list), you access it by something you assign to it (ie Name_dictionary['me'] returns ded, Name_dictionary['you'] returns Dimitris. If you deselect the answer by HYRY as 'accepted' i think this may get more attention.
thanks for the introduction to dictionaries - and the suggestion on the "accepted" term (although I will give credit to both of you when I am fully finishes since both suggestions seem to work) I still have the same problem - my text file appears not to be iterable and I get the result I posted below (even from your code) - I am working on that now..
@Dimitris i don't think you should have been able to get the same error in both cases since neither of us used the same thing... I also see several references to "line 187, 141, 637 which seems to be much longer than you should need for either solution. Are you trying to do something else with the code beyond what you asked? Can you post exactly what your code is?
I will post my code right now - I solved the original problem (see my posted code) - the outpout I want is a text file with the original columns, without the redundant rows, and the average values on column 8 (7 if you start counting with 0 as in python)
0
import os #needed system utils
import numpy as np# for array data processing


datadirectory = '/media/DATA/arxeia/Dimitris/Testing/12_11'
working = os.environ.get("WORKING_DIRECTORY", datadirectory)
os.chdir( working)

##HERE I WAS TRYING TO READ THE FILE, AND THEN USE THE NAME OF THE STRING IN THE FOLLOWING LINE - THAT RESULTED IN THE SAME ERROR DESCRIBED BELOW (ERROR # 42 (I think) - too large name)

data_dict = {} #Create empty dictionary
for line in open('/media/DATA/arxeia/Dimitris/Testing/12_11/1a.dat'): ##above error resolved when used this
    line = line.rstrip()
    columns = line.split()
    entry = [columns[0], columns[1], columns[4]]
    entry = "-".join(entry)
    try: #valid if have already seen combination of 1,2,5
        x = data_dict[entry].append(float(columns[7])) 
    except (KeyError): #KeyError the first time you see a combination of columns 1,2,5
        data_dict[entry] = [float(columns[7])]

for entry in data_dict:
    value = np.mean(data_dict[entry])   
    output = entry.split("-")
    output.append(str(value))
    output = "\t".join(output)
   print output

MY OTHER PROBLEM NOW IS GETTING OUTPUT IN STRING FORMAT (OR ANY FORMAT) - THEN I BELIEVE I KNOW I CAN GET TO THE SAVE PART AND MANIPULATE THE FINAL FORMAT

np.savetxt('sorted_data.dat', sorted, fmt='%s', delimiter='\t') #Save the data

I STILL HAVE TO FIGURE HOW TO ADD THE OTHER COLUMNS - I AM WORKING ON THAT TOO

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.