4

I have a DataFrame with columns X, Y and value, e.g.:

   X |   Y | value
------------------
   1 |   1 |    56
   2 |   1 |    13
   3 |   1 |    25
 ... | ... |   ...
   1 |   2 |     7
   2 |   2 |    18
 ... | ... |   ...
   1 | 123 |    91
 ... | ... |   ...
  50 | 123 |    32

I need to convert this to DataFrame to a NumPy matrix:

[[56, 13, 25, ...],
 [ 7, 18,     ...],
 ...,
 [ 91, ...   , 32]]

I know I can iterate over each cell of the DataFrame, but that is too slow. What is the effective way of doing this?

Also note: values for some coordinates in DataFrame are missing

3
  • Did you try something along the lines of df.value.values.reshape(-1,ncols)? Commented Aug 11, 2017 at 17:36
  • numpyMatrix = df.as_matrix() Commented Aug 11, 2017 at 17:37
  • @Divakar Not working, I'm getting {ValueError}total size of new array must be unchanged, probably because the DataFrame contains missing values. Commented Aug 11, 2017 at 17:40

3 Answers 3

11

Pivot the data frame and the values should be what you need:

df.pivot('Y', 'X', 'value').values

#array([[ 56.,  13.,  25.,  nan],
#       [  7.,  18.,  nan,  nan],
#       [ 91.,  nan,  nan,  32.]])
Sign up to request clarification or add additional context in comments.

Comments

4

Using set_index

In [501]: df.set_index(['Y', 'X']).unstack().values
Out[501]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, Using groupby

In [493]: df.groupby(['Y', 'X'])['value'].sum().unstack().values
Out[493]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, Using crosstab

In [500]: pd.crosstab(index=df.Y, columns=df.X, values=df.value, aggfunc='sum').values
Out[500]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, using pd.pivot_table as pointed in another answer.

Comments

4

I would do this by going through a sparse coordinate matrix, which is basically the format you have.

NB, missing spots will be stored as 0s if you convert to an array.

If you have a ton missing, it might be better to stick to a sparse matrix for memory or performance reasons depending on your downstream processes.

x = pd.DataFrame({'X':[1,2,3,1,2,1,4], 'Y':[1,1,1,2,2,3,3], 'Z':[56,13,25,7,18,91,32]})

#import coo from sparse
from scipy.sparse import coo_matrix
#it works like (data,(y,x))
out = coo_matrix((x.Z,(x.Y-1,x.X-1))) #-1, as you aren't 0 indexed above
#if you really don't want sparse turn it to an array:
out.toarray()
array([[56, 13, 25,  0],
       [ 7, 18,  0,  0],
       [91,  0,  0, 32]], dtype=int64)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.