1

My Pyspark Dataframe looks like this:

+--------+----------+----+----+----+
|latitude| longitude|var1|date|var2|
+--------+----------+----+----+----+
|    3.45|     -8.65|   1|   7|   2|
|   30.45|     45.65|   1|   7|   2|
|   40.45|    123.65|   1|   7|   2|
|   43.45|     13.65|   1|   7|   2|
|   44.45|    -12.65|   1|   7|   2|
|   54.45|   -128.65|   1|   7|   2|
+--------+----------+----+----+----+

but I dont know how to reshape it to get only a register for each date and a multicolumn specifying [variable, latitude, longitude] in that order, so I could treat each combination of variable, latitude and longitude in a separated column.

Making this:

df.select(
    'date',
    *[F.array(F.col(col), F.col('latitude'), F.col('longitude')) for col in var_cols]
).show()

I get:

+----+---------------------------------+---------------------------------+
|date|array(var1, latitude, longitude) |array(var2, latitude, longitude) |
+----+---------------------------------+---------------------------------+
|   7|               [1.0, 3.45, -8.65]|               [2.0, 3.45, -8.65]|
|   7|              [1.0, 30.45, 45.65]|              [2.0, 30.45, 45.65]|
|   7|             [1.0, 40.45, 123.65]|             [2.0, 40.45, 123.65]|
|   7|              [1.0, 43.45, 13.65]|              [2.0, 43.45, 13.65]|
|   7|             [1.0, 44.45, -12.65]|             [2.0, 44.45, -12.65]|
|   7|             [1.0, 54.45, -128...|             [2.0, 54.45, -128...|
+----+---------------------------------+---------------------------------+

And I would like a column with a single value (the value of the var) and a column by EACH VALUE of the latitude and longitude. Imagine making an index of [date, latitude, longitude] in pandas and then unstacking the latitude and longitude columns.

For example, in pandas I would do this:

df.set_index(["date", "latitude", "longitude"]).unstack().unstack()
4
  • how do you want to treat the two variables? [var1, lat, long], [var2,lat,long], or [var1, var2, lat, long]? Commented Nov 30, 2020 at 8:11
  • The [var1, lat, long], [var2,lat,long] way @mck Commented Nov 30, 2020 at 8:12
  • So you want 3 columns, date, [v1,l,l], [v2,l,l]? Commented Nov 30, 2020 at 8:13
  • Thats it @mck . Commented Nov 30, 2020 at 8:14

2 Answers 2

1

How about this:

var_cols = [col for col in df.columns if col not in ['date', 'latitude', 'longitude']]

df.withColumn('latlong',
              F.concat_ws('_', F.col('latitude'), F.col('longitude'))) \
  .groupBy('date') \
  .pivot('latlong') \
  .agg(*[F.first(col) for col in var_cols])
Sign up to request clarification or add additional context in comments.

4 Comments

Pivoting on "latlong" returns me an error: Py4JJavaError: An error occurred while calling o955.pivot. : java.lang.RuntimeException: Unsupported literal type class scala.collection.mutable.WrappedArray$ofRef WrappedArray(3.45, -8.65) at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419) at org.apache.spark.sql.RelationalGroupedDataset$$anonfun$pivot$1.apply(RelationalGroupedDataset.scala:419)
What version of spark are you using? This works for me on Spark 3.0.0
Seems to be issues.apache.org/jira/browse/SPARK-26403 . I guess you're using Spark 2
That's it. I'm using Spark 2
1

I came across this solution:

var_cols = [col for col in df.columns if col not in ['date', 'latitude', 'longitude']]

df = df.withColumn('latlong',F.array(F.col('latitude'), F.col('longitude')))

df = df.withColumn('latlong', F.concat_ws(',', 'latlong'))
df = df.groupBy(["date"]).pivot("latlong").max(*var_cols)

2 Comments

@mck Can you spot any problem on this solution? The combination of date, laitude and longitude should be unique, so the max aggregation function shoul work fine. Is there any improvement in terms of efficiency?
looks good to me, but the two lines can be combined to one. I'll update my answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.