1

I have a PYSPARK dataframe df with values 'latitude' and 'longitude':

+---------+---------+
| latitude|longitude|
+---------+---------+
|51.822872| 4.905615|
|51.819645| 4.961687|
| 51.81964| 4.961713|
| 51.82256| 4.911187|
|51.819263| 4.904488|
+---------+---------+

I want to get the UTM coordinates ('x' and 'y') from the dataframe columns. To do this, I need to feed the values 'longitude' and 'latitude' to the following function from pyproj. The result 'x' and 'y' should then be append to the original dataframe df. This is how I did it in Pandas:

from pyproj import Proj
pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
xx, yy = pp(df["longitude"].values, df["latitude"].values)
df["X"] = xx
df["Y"] = yy

How would I do this in Pyspark?

1 Answer 1

2

Use pandas_udf, feed the function with an array and then return an array as well. see below:

from pyspark.sql.functions import array, pandas_udf, PandasUDFType
from pyproj import Proj
from pandas import Series

@pandas_udf('array<double>', PandasUDFType.SCALAR)
def get_utm(x):
  pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
  return Series([ pp(e[0], e[1]) for e in x ])

df.withColumn('utm', get_utm(array('longitude','latitude'))) \
  .selectExpr("*", "utm[0] as X", "utm[1] as Y") \
  .show()
Sign up to request clarification or add additional context in comments.

8 Comments

I got a variety of errors in the df.withColumn('utm', get_utm(array('longitude','latitude'))) \ .selectExpr("*", "utm[0] as X", "utm[1] as Y") Such as: - Python worker failed to connect back. - java.net.SocketTimeoutException: Accept timed out Is there another way or are these errors which are worth solving?
@Jeroen, there is a known issue regarding pandas_udf and the pyarrow version, see link: stackoverflow.com/questions/61202005/…. If it's not the same issue, can you post the ERROR?
I checked my version of Pyspark, which is v2.4.5 and pyarrow version was 0.13. According to the link I had to downgrade to version 0.10, which I did. I checked with conda list and it shows version 0.10 for Pyarrow. But now I get the error: ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.
can you give me your version of PyArrow and Pyspark?
I accepted the answer, but still haven't solved the Pyarrow problem. I will look into it this weekend more.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.