Numpy operations on ndarrays containing strings and numbers

Question

This is my very first question on stackoverflow. So far all my questions had already been asked, but even after much research I couldn't find an answer to this one. So here goes:

I would like to do mathematical operations in numpy arrays for which I casted a dtype. This would be trivial in R but is complicated in python.

import numpy as np
from StringIO import StringIO
test = "a,1,2\nb,3,4"
data = np.genfromtxt(StringIO(test), delimiter=",", dtype=None)

This gives me:

print data
#array([('a', 1, 2), ('b', 3, 4)],
#      dtype=[('f0', '|S1'), ('f1', '<i8'), ('f2', '<i8')])

But then if I try to perform any mathematical operation on the numerical subset of data I get error messages:

subData = data[['f1','f2']]
print subData
# [(1, 2) (3, 4)]
subData+1
#TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'int'

or even:

subData + subData
#TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray'

The only solution I came up with is not a very elegant or practical one because I tend lose the column names and types as well as the original shape:

subData.view(int) + 1

Thanks a lot in advance.

For what it's worth, numpy's stuctured arrays aren't really meant for this sort of thing. They're arrays of C-like structs, not "spreadsheet-like" data. The typical way to handle it is to hold each column in a separate array. pandas is a much better choice for this, though. It's meant for "spreadsheet-like" data. — Joe Kington
– Joe Kington, Commented Feb 9, 2014 at 16:35

Joe Kington · Accepted Answer · 2014-02-09 16:51:53Z

1

Just to elaborate on my comment, structured arrays aren't exactly meant for this. They're arrays of C-like structs. They can be used to hold columns of different types, but it will become cumbersome quickly. They're very useful for certain things, but "spreadsheet-like" data is not one of them. Typically, you'd just store each column as its own array when they have different types. (This is essentially what pandas does.)

This is because structured arrays aren't arrays where the columns have different types, they're arrays where each item is a sequence that has different types.

If you did want to convert all but the first column into a "normal" 2D array, you'd do something like this:

numeric_data = np.c_[[data[col] for col in data.dtype.names[1:]]]

However, ror data where each column is a different type, it's far better to use pandas. It's meant for spreadsheet-like data.

from StringIO import StringIO
import pandas as pd

test = "a,1,2\nb,3,4"
data = pd.read_csv(StringIO(test), header=None)

print data[[1,2]] + 5

answered Feb 9, 2014 at 16:51

Joe Kington

287k73 gold badges621 silver badges474 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Diogo Over a year ago

Thanks Joe. I was actually rather trying to avoid having to resort to Pandas. So I prefer the first option :) I still find it weird that I can't do something so simple easily. What are structured arrays meant for if not this?

Collectives™ on Stack Overflow

Numpy operations on ndarrays containing strings and numbers

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related