3

I have a pandas dataframe that has a column that contains tuples made up of two floats e.g. (1.1,2.2). I want to be able to produce an array that contains the first element of each tuple. I could step through each row and get the first element of each tuple but the dataframe contains almost 4 million records and such an approach is very slow. An answer by satoru on SO (stackoverflow.com/questions/6454894/reference-an-element-in-a-list-of-tuples) suggests using the following mechanism:

>>> import numpy as np
>>> arr = np.array([(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8)])
>>> arr
array([[ 1.1,  2.2],
       [ 3.3,  4.4],
       [ 5.5,  6.6],
       [ 7.7,  8.8]])
>>> arr[:,0]
array([ 1.1,  3.3,  5.5,  7.7])

So that works fine and would be absolutely perfect for my needs. However, the problem I have occurs when I try to create a numpy array from a pandas dataframe. In that case, the above solution fails with a variety of errors. For example:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
>>> df
   other       point
0      0  (1.1, 2.2)
1      0  (3.3, 4.4)
2      0  (5.5, 6.6)
3      1  (7.7, 8.8)
4      1  (9.9, 0.0)
>>> arr2 = np.array(df['point'])
>>> arr2
array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)
>>> arr2[:,0]
IndexError: too many indices for array

Alternatively:

>>> arr2 = np.array([df['point']])
>>> arr2
array([[[1.1, 2.2],
        [3.3, 4.4],
        [5.5, 6.6],
        [7.7, 8.8],
        [9.9, 0.0]]], dtype=object)
>>> arr2[:,0]
array([[1.1, 2.2]], dtype=object)   # Which is not what I want!

Something seems to be going wrong when I transfer data from the pandas dataframe to a numpy array - but I've no idea what. Any suggestions would be gratefully received.

2 Answers 2

3

Starting with your dataframe, I can extract a (5,2) array with:

In [68]: df=pandas.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})

In [69]: np.array(df['point'].tolist())
Out[69]: 
array([[ 1.1,  2.2],
       [ 3.3,  4.4],
       [ 5.5,  6.6],
       [ 7.7,  8.8],
       [ 9.9,  0. ]])

df['point'] is a Pandas series.

df['point'].values returns an array of shape (5,), and dtype object. I

array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)

It is, in effect, an array of tuples. Real tuples, not the structured array tuple-look-a-likes. The array actually contains pointers to the tuples, which are else where in memory. Its shape is (5,) - it's a 1d array, so trying to index as though it were 2d will give you the 'too many' error. np.array([df['point']]) just wraps it in another dimension, without addressing the fundamental object dtype issue.

tolist() converts it to a list of tuples, from which you can build the 2d array.

Copying data from arrays of objects to n-d arrays is not trivial, and invariably requires some sort of copying. The data buffers are entirely different, so things like astype don't work.

Sign up to request clarification or add additional context in comments.

1 Comment

Very clear and concise explanation - thanks very much.
0
import numpy as np
import pandas as pd
df = pd.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
array = df['point'].apply(lambda x: x[0]).values
array
# array([ 1.1,  3.3,  5.5,  7.7,  9.9])

1 Comment

Thanks for that solution. That would certainly produce the desired output. However, it doesn't really address the question as to why importing data from a dataframe into a numpy array doesn't work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.