Python slice/subset array with different data type

Question

As a result from a simulation, I have a bunch of csv files divided by spaces. See example below:

Time  Node  Type  Metric 1  Metric 2
0.00   1    Abcd  1234.5678 9012.3456
0.00   1    Efgh  1234.5678 9012.3456
0.01   2    Abcd  1234.5678 9012.3456
0.01   2    Efgh  1234.5678 9012.3456
0.02   3    Abcd  1234.5678 9012.3456
0.02   3    Efgh  1234.5678 9012.3456
0.03   1    Abcd  1234.5678 9012.3456
0.03   1    Efgh  1234.5678 9012.3456
0.04   2    Abcd  1234.5678 9012.3456
0.04   2    Efgh  1234.5678 9012.3456
...

To use the metrics I need to filter the file by node number and type, i.e. Mean of node 1, type Abcd; Mean of node 1, type Efgh; etc.

I know Numpy is very useful to handle arrays, but it only accepts one data type. My current code looks like this (which just print the file's content for now):

import sys

filename = sys.argv[1]
# read file
with open(filename, 'r') as f:
    for line in f:
       print line

# TO DO
# Slice file into different 'Node' number

# Slice subfile into different 'Type'

# Calculate metrics (mean, max, min, and others)
# which is fine once I have the sliced arrays

# Plot graphs

Does anybody knows how to do this in an efficient way?

PS: I am using Python 2.7.

Thanks

Michael Franzen · Accepted Answer · 2016-07-09 19:05:27Z

1

You probably want to use pandas instead of numpy. Assuming you have a tab-delimited file, the code would be as simple as this:

import pandas as pd
data = pd.read_csv("abc.csv", delimiter="\t")
result = data.groupby("Node").mean()

And yield the following result:

Time    Metric 1    Metric 2
Node            
1   0.015   1234.5678   9012.3456
2   0.025   1234.5678   9012.3456
3   0.020   1234.5678   9012.3456

answered Jul 9, 2016 at 19:05

Michael Franzen

4253 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Michael Franzen Over a year ago

@ThiagoTeixeira You're welcome. And yes, pandas will do it if you group by multiple columns like this: data.groupby(["Node", "Type"]).mean()

user3261338 Over a year ago

I will try that. Thanks!

hpaulj · Accepted Answer · 2016-07-09 21:11:45Z

If I put your sample in a file, I can load it into a structured numpy array with

In [45]: names=['Time','Node','Type','Metric_1','Metric_2']
In [46]: data = np.genfromtxt('stack38285208.txt', dtype=None, names=names, skip_header=1)
In [47]: data
Out[47]: 
array([(0.0, 1, b'Abcd', 1234.5678, 9012.3456),
       (0.0, 1, b'Efgh', 1234.5678, 9012.3456),
       (0.01, 2, b'Abcd', 1234.5678, 9012.3456),
       (0.01, 2, b'Efgh', 1234.5678, 9012.3456),
       (0.02, 3, b'Abcd', 1234.5678, 9012.3456),
       (0.02, 3, b'Efgh', 1234.5678, 9012.3456),
       (0.03, 1, b'Abcd', 1234.5678, 9012.3456),
       (0.03, 1, b'Efgh', 1234.5678, 9012.3456),
       (0.04, 2, b'Abcd', 1234.5678, 9012.3456),
       (0.04, 2, b'Efgh', 1234.5678, 9012.3456)], 
      dtype=[('Time', '<f8'), ('Node', '<i4'), ('Type', 'S4'), ('Metric_1', '<f8'), ('Metric_2', '<f8')])

I could not use names=True because you have names like Metric 1 which it would interpret as 2 column names. Hence the separate names list, and the skip_header. I'm using Python3 so the strings for S4 format are shown as b'Efgh'.

I can access fields (columns) by field name, and do various sorts of filter and math with those. For example:

fields where Type is b'Abcd':

In [63]: data['Type']==b'Abcd'
Out[63]: array([ True, False,  True, False,  True, False,  True, False,  True, False], dtype=bool)

and where Node is 1:

In [64]: data['Node']==1
Out[64]: array([ True,  True, False, False, False, False,  True,  True, False, False], dtype=bool)

and together:

In [65]: (data['Node']==1)&(data['Type']==b'Abcd')
Out[65]: array([ True, False, False, False, False, False,  True, False, False, False], dtype=bool)
In [66]: ind=(data['Node']==1)&(data['Type']==b'Abcd')
In [67]: data[ind]
Out[67]: 
array([(0.0, 1, b'Abcd', 1234.5678, 9012.3456),
       (0.03, 1, b'Abcd', 1234.5678, 9012.3456)], 
      dtype=[('Time', '<f8'), ('Node', '<i4'), ('Type', 'S4'), ('Metric_1', '<f8'), ('Metric_2', '<f8')])

I can take the mean of any of the numeric fields from this subset of records:

In [68]: data[ind]['Metric_1'].mean()
Out[68]: 1234.5678
In [69]: data[ind]['Metric_2'].mean()
Out[69]: 9012.3456000000006

I could also assign these fields to variables and work with those directly

In [70]: nodes=data['Node']
In [71]: types=data['Type']
In [72]: nodes
Out[72]: array([1, 1, 2, 2, 3, 3, 1, 1, 2, 2])
In [73]: types
Out[73]: 
array([b'Abcd', b'Efgh', b'Abcd', b'Efgh', b'Abcd', b'Efgh', b'Abcd',
       b'Efgh', b'Abcd', b'Efgh'], 
      dtype='|S4')

the 2 float fields, viewed as a 2 column array:

In [78]: metrics = data[['Metric_1','Metric_2']].view(('float',(2)))
In [79]: metrics
Out[79]: 
array([[ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456]])

metrics where nodes are 1

In [83]: metrics[nodes==1,:]
Out[83]: 
array([[ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456],
       [ 1234.5678,  9012.3456]])
In [84]: metrics[nodes==1,:].mean(axis=0)    # column mean
Out[84]: array([ 1234.5678,  9012.3456])

numpy doesn't have a neat groupby function, though Pandas and itertools do.

Xopherus · Accepted Answer · 2016-07-09 19:34:35Z

My attempt using itertools. Basically this takes advantage of the groupby method which allows you to group consecutive pieces of data together by a lambda function. If you sort the dataset before using groupby, then you can essentially group a dataset by any key.

Not sure how large your dataset is, but if its not too large this should do the trick.

from itertools import groupby
import sys

filename = sys.argv[1]

def parse_data(line):
    # converts a single entry in the csv to a list of values
    return [
            val for val in line.split(' ') if val != ''
    ]


with open(filename, 'r') as input:
    keys = input.readline().split()

    dataset = [
       parse_data(line)
       for line in input.readlines()
    ]

    # group dataset by node
    dataset_grouped_by_node = groupby(
        sorted(dataset, key=lambda x: x[1]), lambda x: x[1]
    )

    for node, node_group in dataset_grouped_by_node:
        # group each of those subgroups by type
        group_sorted_by_type = groupby(
            sorted(node_group, key=lambda x: x[2]), lambda x: x[2]
        )

        for type, type_group in group_sorted_by_type:
            print type, node

            for item in type_group:
                print item

                # calculate statistics on these subgroups

You could clean it up a bit to make a generalized "grouping" function if you wanted, but I think this should get you what you need.

Collectives™ on Stack Overflow

Python slice/subset array with different data type

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related