1

I've got a ndarray which I am trying to read from a CSV file. I can read it via numpy from a file but can't get the structure I want; instead of a 2D array I have an array of tuples

As a MCVE: instead of a 2D array like DataSet1, I have DataSet2:

dataset=numpy.array([
        ["abc ",3000.0,1],
        ["def",3650.0,1],
        ["xyz",3000.0,2]        
        ])
print("DataSet1\n",dataset)
print("DataSet1-Shape\n",dataset.shape)


dataset2=numpy.array([])

dataset2 = np.genfromtxt('file.csv', delimiter=",",dtype='S32,float,int')

print("DataSet2\n",dataset2)
print("DataSet2-Shape\n",dataset2.shape)

The output is:

DataSet1
 [['abc ' '3000.0' '1']
 ['def' '3650.0' '1']
 ['xyz' '3000.0' '2']]
DataSet1-Shape
 (3, 3)
DataSet2
 [(b'"fabc"', 3000.0, 1) (b'"fdef"', 3650.0, 1) (b'"ghi"', 3000.0, 2)]
DataSet2-Shape
 (3,)

I want DataSet2 to be the 2D as DataSet1.

CSV file contents:

"fabc",3000.0,1
"fdef",3650.0,1
"ghi",3000.0,2
4
  • Could you include the content of your csv ? Commented Oct 5, 2016 at 11:10
  • For now csv is 3 lines, but will grow: "fabc",3000.0,1 "fdef",3650.0,1 "ghi",3000.0,2 Commented Oct 5, 2016 at 11:11
  • Please edit your question to include this (I guess there are \n characters missing too) Commented Oct 5, 2016 at 11:12
  • So you are happy that Dataset1 is just strings? You can load the CSV directly like that. Try dtype str. Commented Oct 5, 2016 at 14:30

2 Answers 2

1

Using a list comprehension and casting tuples to lists with np.array([list(tup) for tup in dataset2]) should work:

>>> np.array([list(tup) for tup in dataset2])
array([['"fabc"', '3000.0', '1'],
       ['"fdef"', '3650.0', '1'],
       ['"ghi"', '3000.0', '2']], 
      dtype='|S6')
>>> np.array([list(tup) for tup in dataset2]).shape
(3, 3)

Also notice your dataset2 = numpy.array([]) is useless because dataset2 is overwritten next line. Edit: [list(tup) for tup in dataset2] is the result of map(list, dataset2)

For mixed types in np arrays see Store different datatypes in one NumPy array?; I suggest you use a pandas.DataFrame instead.

Sign up to request clarification or add additional context in comments.

3 Comments

Almost works... except each field value is now a string: [[b'"fabc"' b'3000.0' b'1'] [b'"fdef"' b'3650.0' b'1'] [b'"ghi"' b'3000.0' b'2']]
numpy arrays can have only one type, I think. You can use a pandas.DataFrame if you want mixed type (just do df=pd.DataFrame(your_array))
dataset2.tolist() works just as well as your list comprehension. np.array treats the tuples just like lists - unless given a compound dtype.
0

Your compound dtype loaded the file as a 1d array with 3 fields

In [195]: data=np.genfromtxt('stack39872346.txt',delimiter=',',dtype='S32,float,int')
In [196]: data
Out[196]: 
array([(b'"fabc"', 3000.0, 1), (b'"fdef"', 3650.0, 1),
       (b'"ghi"', 3000.0, 2)], 
      dtype=[('f0', 'S32'), ('f1', '<f8'), ('f2', '<i4')])
In [197]: data.shape
Out[197]: (3,)
In [198]: data.dtype
Out[198]: dtype([('f0', 'S32'), ('f1', '<f8'), ('f2', '<i4')])

Your Dataset1 is 2d with string dtype:

In [207]: Dataset1
Out[207]: 
array([['abc ', '3000.0', '1'],
       ['def', '3650.0', '1'],
       ['xyz', '3000.0', '2']], 
      dtype='<U6')

Converting a compound dtype to a simple one is a little tricky. It can be done with astype. But perhaps it is simpler to use the list version of data as the intermediary.

In [203]: data.tolist()
Out[203]: [(b'"fabc"', 3000.0, 1), (b'"fdef"', 3650.0, 1), (b'"ghi"', 3000.0, 2)]
In [204]: np.array(data.tolist())
Out[204]: 
array([[b'"fabc"', b'3000.0', b'1'],
       [b'"fdef"', b'3650.0', b'1'],
       [b'"ghi"', b'3000.0', b'2']], 
      dtype='|S6')

np.array has read the list of tuples, and created a 2d array with the most-common type, S6 (Py3 bytestring)

Now it is easy to convert to unicode string with astype:

In [205]: np.array(data.tolist()).astype("U6")
Out[205]: 
array([['"fabc"', '3000.0', '1'],
       ['"fdef"', '3650.0', '1'],
       ['"ghi"', '3000.0', '2']], 
      dtype='<U6')

This is similar to Dataset1, except that the first column is double quoted.

I could skip the last astype by specifying dtype: np.array(data.tolist(),dtype=str)

Better yet, tell that to the genfromtxt:

np.genfromtxt('stack39872346.txt',delimiter=',',dtype=str)

A nice thing about the original compound dtype is that you can access the numeric fields as numbers:

In [214]: data['f1']
Out[214]: array([ 3000.,  3650.,  3000.])
In [215]: Dataset1[:,1]
Out[215]: 
array(['3000.0', '3650.0', '3000.0'], 
      dtype='<U6')

I haven't addressed the double quotes. The csv reader can strip those; genfromtxt does not. Though fortunately you don't have delimiters within the quotes, so I could write a converter that would strip them off during the genfromtxt read.

=================

def foo(astr):
    return astr[1:-1] # crude dequote

In [223]: data=np.genfromtxt('stack39872346.txt',delimiter=',',
     dtype='U6,float,int', converters={0:foo})
In [224]: data
Out[224]: 
array([('fabc', 3000.0, 1), 
       ('fdef', 3650.0, 1), 
       ('ghi', 3000.0, 2)], 
      dtype=[('f0', '<U6'), ('f1', '<f8'), ('f2', '<i4')])

In [225]: np.array(data.tolist())
Out[225]: 
array([['fabc', '3000.0', '1'],
       ['fdef', '3650.0', '1'],
       ['ghi', '3000.0', '2']], 
      dtype='<U6')

It looks like I have to use a compound dtype when loading with a converter.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.