Pandas DataFrame with X,Y coordinates to NumPy matrix

Question

I have a DataFrame with columns X, Y and value, e.g.:

   X |   Y | value
------------------
   1 |   1 |    56
   2 |   1 |    13
   3 |   1 |    25
 ... | ... |   ...
   1 |   2 |     7
   2 |   2 |    18
 ... | ... |   ...
   1 | 123 |    91
 ... | ... |   ...
  50 | 123 |    32

I need to convert this to DataFrame to a NumPy matrix:

[[56, 13, 25, ...],
 [ 7, 18,     ...],
 ...,
 [ 91, ...   , 32]]

I know I can iterate over each cell of the DataFrame, but that is too slow. What is the effective way of doing this?

Also note: values for some coordinates in DataFrame are missing

Did you try something along the lines of df.value.values.reshape(-1,ncols)? — Divakar
– Divakar, Commented Aug 11, 2017 at 17:36
@Divakar Not working, I'm getting {ValueError}total size of new array must be unchanged, probably because the DataFrame contains missing values. — Peter
– Peter, Commented Aug 11, 2017 at 17:40

akuiper · Accepted Answer · 2017-08-11 17:40:53Z

11

Pivot the data frame and the values should be what you need:

df.pivot('Y', 'X', 'value').values

#array([[ 56.,  13.,  25.,  nan],
#       [  7.,  18.,  nan,  nan],
#       [ 91.,  nan,  nan,  32.]])

answered Aug 11, 2017 at 17:40

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Zero · Accepted Answer · 2017-08-11 17:44:25Z

4

Using set_index

In [501]: df.set_index(['Y', 'X']).unstack().values
Out[501]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, Using groupby

In [493]: df.groupby(['Y', 'X'])['value'].sum().unstack().values
Out[493]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, Using crosstab

In [500]: pd.crosstab(index=df.Y, columns=df.X, values=df.value, aggfunc='sum').values
Out[500]:
array([[ 56.,  13.,  25.,  nan],
       [  7.,  18.,  nan,  nan],
       [ 91.,  nan,  nan,  32.]])

Or, using pd.pivot_table as pointed in another answer.

answered Aug 11, 2017 at 17:44

Zero

77.4k22 gold badges154 silver badges154 bronze badges

Comments

jeremycg · Accepted Answer · 2017-08-11 18:08:33Z

4

I would do this by going through a sparse coordinate matrix, which is basically the format you have.

NB, missing spots will be stored as 0s if you convert to an array.

If you have a ton missing, it might be better to stick to a sparse matrix for memory or performance reasons depending on your downstream processes.

x = pd.DataFrame({'X':[1,2,3,1,2,1,4], 'Y':[1,1,1,2,2,3,3], 'Z':[56,13,25,7,18,91,32]})

#import coo from sparse
from scipy.sparse import coo_matrix
#it works like (data,(y,x))
out = coo_matrix((x.Z,(x.Y-1,x.X-1))) #-1, as you aren't 0 indexed above
#if you really don't want sparse turn it to an array:
out.toarray()
array([[56, 13, 25,  0],
       [ 7, 18,  0,  0],
       [91,  0,  0, 32]], dtype=int64)

edited Aug 11, 2017 at 18:08

answered Aug 11, 2017 at 17:54

jeremycg

25k6 gold badges67 silver badges78 bronze badges

Collectives™ on Stack Overflow

Pandas DataFrame with X,Y coordinates to NumPy matrix

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related