0

As a result from a simulation, I have a bunch of csv files divided by spaces. See example below:

Time  Node  Type  Metric 1  Metric 2
0.00   1    Abcd  1234.5678 9012.3456
0.00   1    Efgh  1234.5678 9012.3456
0.01   2    Abcd  1234.5678 9012.3456
0.01   2    Efgh  1234.5678 9012.3456
0.02   3    Abcd  1234.5678 9012.3456
0.02   3    Efgh  1234.5678 9012.3456
0.03   1    Abcd  1234.5678 9012.3456
0.03   1    Efgh  1234.5678 9012.3456
0.04   2    Abcd  1234.5678 9012.3456
0.04   2    Efgh  1234.5678 9012.3456
...

To use the metrics I need to filter the file by node number and type, i.e. Mean of node 1, type Abcd; Mean of node 1, type Efgh; etc.

I know Numpy is very useful to handle arrays, but it only accepts one data type. My current code looks like this (which just print the file's content for now):

import sys

filename = sys.argv[1]
# read file
with open(filename, 'r') as f:
    for line in f:
       print line

# TO DO
# Slice file into different 'Node' number

# Slice subfile into different 'Type'

# Calculate metrics (mean, max, min, and others)
# which is fine once I have the sliced arrays

# Plot graphs

Does anybody knows how to do this in an efficient way?

PS: I am using Python 2.7.

Thanks

3 Answers 3

1

You probably want to use pandas instead of numpy. Assuming you have a tab-delimited file, the code would be as simple as this:

import pandas as pd
data = pd.read_csv("abc.csv", delimiter="\t")
result = data.groupby("Node").mean()

And yield the following result:

Time    Metric 1    Metric 2
Node            
1   0.015   1234.5678   9012.3456
2   0.025   1234.5678   9012.3456
3   0.020   1234.5678   9012.3456
Sign up to request clarification or add additional context in comments.

2 Comments

@ThiagoTeixeira You're welcome. And yes, pandas will do it if you group by multiple columns like this: data.groupby(["Node", "Type"]).mean()
I will try that. Thanks!
1

If I put your sample in a file, I can load it into a structured numpy array with

In [45]: names=['Time','Node','Type','Metric_1','Metric_2']
In [46]: data = np.genfromtxt('stack38285208.txt', dtype=None, names=names, skip_header=1)
In [47]: data
Out[47]: 
array([(0.0, 1, b'Abcd', 1234.5678, 9012.3456),
       (0.0, 1, b'Efgh', 1234.5678, 9012.3456),
       (0.01, 2, b'Abcd', 1234.5678, 9012.3456),
       (0.01, 2, b'Efgh', 1234.5678, 9012.3456),
       (0.02, 3, b'Abcd', 1234.5678, 9012.3456),
       (0.02, 3, b'Efgh', 1234.5678, 9012.3456),
       (0.03, 1, b'Abcd', 1234.5678, 9012.3456),
       (0.03, 1, b'Efgh', 1234.5678, 9012.3456),
       (0.04, 2, b'Abcd', 1234.5678, 9012.3456),
       (0.04, 2, b'Efgh', 1234.5678, 9012.3456)], 
      dtype=[('Time', '<f8'), ('Node', '<i4'), ('Type', 'S4'), ('Metric_1', '<f8'), ('Metric_2', '<f8')])

I could not use names=True because you have names like Metric 1 which it would interpret as 2 column names. Hence the separate names list, and the skip_header. I'm using Python3 so the strings for S4 format are shown as b'Efgh'.

I can access fields (columns) by field name, and do various sorts of filter and math with those. For example:

fields where Type is b'Abcd':

In [63]: data['Type']==b'Abcd'
Out[63]: array([ True, False,  True, False,  True, False,  True, False,  True, False], dtype=bool)

and where Node is 1:

In [64]: data['Node']==1
Out[64]: array([ True,  True, False, False, False, False,  True,  True, False, False], dtype=bool)

and together:

In [65]: (data['Node']==1)&(data['Type']==b'Abcd')
Out[65]: array([ True, False, False, False, False, False,  True, False, False, False], dtype=bool)
In [66]: ind=(data['Node']==1)&(data['Type']==b'Abcd')
In [67]: data[ind]
Out[67]: 
array([(0.0, 1, b'Abcd', 1234.5678, 9012.3456),
       (0.03, 1, b'Abcd', 1234.5678, 9012.3456)], 
      dtype=[('Time', '<f8'), ('Node', '<i4'), ('Type', 'S4'), ('Metric_1', '<f8'), ('Metric_2', '<f8')])

I can take the mean of any of the numeric fields from this subset of records:

In [68]: data[ind]['Metric_1'].mean()
Out[68]: 1234.5678
In [69]: data[ind]['Metric_2'].mean()
Out[69]: 9012.3456000000006

I could also assign these fields to variables and work with those directly

In [70]: nodes=data['Node']
In [71]: types=data['Type']
In [72]: nodes
Out[72]: array([1, 1, 2, 2, 3, 3, 1, 1, 2, 2])
In [73]: types
Out[73]: 
array([b'Abcd', b'Efgh', b'Abcd', b'Efgh', b'Abcd', b'Efgh', b'Abcd',
       b'Efgh', b'Abcd', b'Efgh'], 
      dtype='|S4')

the 2 float fields, viewed as a 2 column array:

In [78]: metrics = data[['Metric_1','Metric_2']].view(('float',(2)))
In [79]: metrics
Out[79]: 
array([[ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456]])

metrics where nodes are 1

In [83]: metrics[nodes==1,:]
Out[83]: 
array([[ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456]])
In [84]: metrics[nodes==1,:].mean(axis=0)    # column mean
Out[84]: array([ 1234.5678,  9012.3456])

numpy doesn't have a neat groupby function, though Pandas and itertools do.

Comments

0

My attempt using itertools. Basically this takes advantage of the groupby method which allows you to group consecutive pieces of data together by a lambda function. If you sort the dataset before using groupby, then you can essentially group a dataset by any key.

Not sure how large your dataset is, but if its not too large this should do the trick.

from itertools import groupby
import sys

filename = sys.argv[1]

def parse_data(line):
    # converts a single entry in the csv to a list of values
    return [
            val for val in line.split(' ') if val != ''
    ]


with open(filename, 'r') as input:
    keys = input.readline().split()

    dataset = [
       parse_data(line)
       for line in input.readlines()
    ]

    # group dataset by node
    dataset_grouped_by_node = groupby(
        sorted(dataset, key=lambda x: x[1]), lambda x: x[1]
    )

    for node, node_group in dataset_grouped_by_node:
        # group each of those subgroups by type
        group_sorted_by_type = groupby(
            sorted(node_group, key=lambda x: x[2]), lambda x: x[2]
        )

        for type, type_group in group_sorted_by_type:
            print type, node

            for item in type_group:
                print item

                # calculate statistics on these subgroups

You could clean it up a bit to make a generalized "grouping" function if you wanted, but I think this should get you what you need.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.