0

I frequently come across the use case that I have (a time ordered) Spark dataframe with values from which I should like to know the differences between consecutive rows:

>>> df.show()
+-----+----------+----------+
|index|        c1|        c2|
+-----+----------+----------+
|  0.0|0.35735932|0.39612636|
|  1.0| 0.7279809|0.54678476|
|  2.0|0.68788993|0.25862947|
|  3.0|  0.645063| 0.7470685|
+-----+----------+----------+

The question on how to do this has been asked before in a narrower context:

pyspark, Compare two rows in dataframe

Date difference between consecutive rows - Pyspark Dataframe

However, I find the answers rather involved:

  • a separate module "Window" must be imported
  • for some data types (datetimes) a cast must be done
  • then using "lag" finally the rows can be compared

It strikes me as odd, that this cannot be done with a single API call like, for example, so:

>>> import pyspark.sql.functions as f
>>> df.select(f.diffs(df.c1)).show()
+----------+
| diffs(c1)|
+----------+
|   0.3706 |
|  -0.0400 |
|  -0.0428 |
|     null |
+----------+

What is the reason for this?

1

1 Answer 1

3

There are a few basic reasons:

  • In general distributed data structures used in Spark are not ordered. In particular any lineage containing shuffle phase / exchange can output a structure with non-deterministic order.

    As a result when we talk about Spark DataFrame we actually mean relation not DataFrame as known from local libraries like Pandas and without explicit ordering comparing consecutive rows is just not meaningful.

  • Things are even more fuzzy when you realize that sorting methods used in Spark use shuffles and are not stable.

  • If you ignore possible non-determinism handling partition boundaries is a bit involved and typically breaks lazy execution.

    In other words you cannot access an element which is left from the first element on a given partition or right from the last element of a given partition without a shuffle, an additional action or separate data scan.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.