0

I´ve a Pandas dataframe that I read from csv and contains X and Y coordinates and a value that I need to put in a matrix and save it to a text file. So, I created a numpy array with max(X) and max(Y) extension.

I´ve this file:

fid,x,y,agblongo_tch_alive
2368458,1,1,45.0126083457747
2368459,1,2,44.8996854102889
2368460,2,2,45.8565022933761
2358154,3,1,22.6352522929758
2358155,3,3,23.1935887499899

And I need this one:

   45.01    44.89 -9999.00    
-9999.00    45.85 -9999.00
   22.63 -9999.00    23.19

To do that, I´m using a loop like this:

for row in data.iterrows():
    p[int(row[1][2]),int(row[1][1])] = row[1][3]

and then I save it to disk using np.array2string. It works.

As the original csv has 68 M lines, it´s taking a lot of time to process, so I wonder if there´s another more pythonic and fast way to do that.

10
  • 1
    Could you provide a minimal reproducible example? It's not really clear what you're trying to do here Commented Jun 11, 2018 at 21:14
  • I think you want stackoverflow.com/questions/45640852/… and then just write the output to a text file using array2string Commented Jun 11, 2018 at 21:16
  • What are you actually trying to solve with the matrix? The sensible way might be to keep the matrix in memory rather than write to disk. Commented Jun 11, 2018 at 21:21
  • if the 68M row are a flattened representation of your matrix, then it's ~8250 points which is already pretty huge. Commented Jun 11, 2018 at 21:26
  • I edited the question, I need to write it to disk because I need the file in a specific format, not the matrix itself. I´ll check the solution user3483203. Commented Jun 11, 2018 at 21:31

1 Answer 1

0

Assuming the columns of your df are 'x', 'y', 'value', you can use advanced indexing

>>> x, y, value = data['x'].values, data['y'].values, data['value'].values
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> result[y, x] = value

This will, however, not work properly if coordiantes are not unique. In that case it is safer (but slower) to use add.at:

>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> np.add.at(result, (y, x), value)

Alternatively, you can create a sparse matrix since your data happen to be in sparse coo format. Using the '.A' property you can then convert that to a normal (dense) array as needed:

>>> from scipy import sparse
>>> spM = sparse.coo_matrix((value, (y, x)), (y.max()+1, x.max()+1))
>>> (spM.A == result).all()
True

Update: if the fillvalue is not zero the above must be modified.

Method 1: replace second line with (remember this should only be used if coordinates are unique):

>>> result = np.full((y.max()+1, x.max()+1), fillvalue, value.dtype)

Method 2: does not work

Method 3: after creating spM do

>>> spM.sum_duplicates()
>>> assert spM.has_canonical_format
>>> spM.data -= fillvalue
>>> result2 = spM.A + fillvalue
Sign up to request clarification or add additional context in comments.

3 Comments

Paul, tks! It´s almost there, however, I cant sum the duplicate values because it will double the corresponding value. Instead of it, I need to delete one of the lines.
@MauroAssis if all rows with the same coordinates also have the same value and you don't want to count their multiplicity you can actually use Method 1. Because in that case it doesn't matter that the order of assignment is undefined.
Paul, in fact, I checked and there´s only one duplicated value, so I think I will keep method three. Thank you very much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.