0

I am new to Apache Spark, I have a use case to find the date gap identification between multiple dates.

e.g

In the above example, the member had a gap between 2018-02-01 to 2018-02-14. How to find this Apache Spark 2.3.4 using Scala.

Excepted output for the above scenario is,

enter image description here

0

1 Answer 1

1

You could use datediff along with Window function lag to check for day-gaps between current and previous rows, and compute the missing date ranges with some date functions:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
import java.sql.Date

val df = Seq(
  (1, Date.valueOf("2018-01-01"), Date.valueOf("2018-01-31")),
  (1, Date.valueOf("2018-02-16"), Date.valueOf("2018-02-28")),
  (1, Date.valueOf("2018-03-01"), Date.valueOf("2018-03-31")),
  (2, Date.valueOf("2018-07-01"), Date.valueOf("2018-07-31")),
  (2, Date.valueOf("2018-08-16"), Date.valueOf("2018-08-31"))
).toDF("MemberId", "StartDate", "EndDate")

val win = Window.partitionBy("MemberId").orderBy("StartDate", "EndDate")

df.
  withColumn("PrevEndDate", coalesce(lag($"EndDate", 1).over(win), date_sub($"StartDate", 1))).
  withColumn("DayGap", datediff($"StartDate", $"PrevEndDate")).
  where($"DayGap" > 1).
  select($"MemberId", date_add($"PrevEndDate", 1).as("StartDateGap"), date_sub($"StartDate", 1).as("EndDateGap")).
  show
// +--------+------------+----------+
// |MemberId|StartDateGap|EndDateGap|
// +--------+------------+----------+
// |       1|  2018-02-01|2018-02-15|
// |       2|  2018-08-01|2018-08-15|
// +--------+------------+----------+
Sign up to request clarification or add additional context in comments.

5 Comments

This one is good, but it fails in certain cases where dates overlap, e.g. if you add (2, Date.valueOf("2018-05-01"), Date.valueOf("2018-08-05") to the DF you will still get the same result as above. Still a good answer, assuming steps are taken to remove redundant data.
Thanks, but it fails when there are overlapping dates e.g if you add (1,Date.valueOf("2018-01-15"),Date.valueOf("2018-01-30")) to the DF you will get StartDateGap as "2018-01-31" however this member has an enrollment on "2018-01-31". Could you kindly tell me how to remove the overlapping days before computing this.?
Leo, I found one of your answers to remove overlapping dates (stackoverflow.com/questions/52877237/…). Thanks a lot. I will try my best to merge these two. willl let you know if i need any more help. Thanks a lot.
@Gowri24, yes the above solution is based on your simplistic sample data without overlapping of date ranges. Combining my other solution with this one will work. After applying the other solution, simply apply to the resulting DataFrame resDF.withColumn("RowNum", row_number.over(win3)).where($"RowNum" === 1), with win3 defined as Window.partitionBy("MemberId", "StartDate", "EndDate").orderBy("StartDate", "EndDate"), followed by applying the above transformations for gaps finding.
@LeoC, Thanks for your response.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.