2

Using a table of events, I need to return the date and type for:

  • the first event
  • the most recent (non-null) event

The most recent event could have null values, which in that case needs to return the most recent non-null value

I found a few articles as well as posts here on SO that are similar (maybe even identical) but am not able to decode or understand the solution - i.e.

Fill null values with last non-null amount - Oracle SQL

https://www.itprotoday.com/sql-server/last-non-null-puzzle

https://koukia.ca/common-sql-problems-filling-null-values-with-preceding-non-null-values-ad538c9e62a6

Table is as follows - there are additional columns, but I am only including 3 for the sake of simplicity. Also note that the first Type and Date could be null. In this case returning null is desired.

╔═══════╦════════╦════════════╗
║ Email ║  Type  ║    Date    ║
╠═══════╬════════╬════════════╣
║ A     ║ Create ║ 2019-04-01 ║
║ A     ║ Update ║ 2019-04-02 ║
║ A     ║ null   ║ null       ║
╚═══════╩════════╩════════════╝

The output should be:

╔═══════╦═══════════╦════════════╦══════════╦════════════╗
║ Email ║ FirstType ║ FirstDate  ║ LastType ║  LastDate  ║
╠═══════╬═══════════╬════════════╬══════════╬════════════╣
║ A     ║ Create    ║ 2019-04-01 ║ Update   ║ 2019-04-02 ║
╚═══════╩═══════════╩════════════╩══════════╩════════════╝

The first method I tried was to join the table to itself using a subquery that finds the MIN and MAX dates using case statements:

select
  Email,
  max(case when T1.Date = T2.Min_Date then T1.Type end) as FirstType,
  max(case when T1.Date = T2.Min_Date then T1.Date end) as FirstDate,
  max(case when T1.Date = T2.Max_Date then T1.Type end) as LastType,
  max(case when T1.Date = T2.Max_Date then T1.Date end) as LastDate,
from
  T1
join
  (select
    EmailAddress,
    max(Date) as Max_Date,
    min(Date) as Min_Date
  from
    Table1
  group by 
    Email
  ) T2
on
  T1.Email = T2.Email
group by
  T1.Email

This seemed to work for the MIN values, but the MAX values would return null.

To solve the problem of returning the last non-value I attempted this:

select
   EmailAddress,
   max(Date) over (partition by EmailAddress rows unbounded preceding) as LastDate,
   max(Type) over (partition by EmailAddress rows unbounded preceding) as LastType
from
   T1
group by
   EmailAddress,
   Date,
   Type

However, this gives a result of 3 rows, instead of 1.

I'll admit I don't quite understand analytic functions since I have not had to deal with them at length. Any help would be greatly appreciated.

Edit: The aforementioned example is an accurate representation of what the data could look like, however the below example is the exact sample data that I am using.

Sample:

╔═══════╦════════╦════════════╗
║ Email ║  Type  ║    Date    ║
╠═══════╬════════╬════════════╣
║ A     ║ Create ║ 2019-04-01 ║
║ A     ║ null   ║ null       ║
╚═══════╩════════╩════════════╝

Desired Outcome:

╔═══════╦═══════════╦════════════╦══════════╦════════════╗
║ Email ║ FirstType ║ FirstDate  ║ LastType ║  LastDate  ║
╠═══════╬═══════════╬════════════╬══════════╬════════════╣
║ A     ║ Create    ║ 2019-04-01 ║ Create   ║ 2019-04-01 ║
╚═══════╩═══════════╩════════════╩══════════╩════════════╝

Additional Use-Case:

╔═══════╦════════╦════════════╗
║ Email ║  Type  ║    Date    ║
╠═══════╬════════╬════════════╣
║ A     ║ null   ║ null       ║
║ A     ║ Create ║ 2019-04-01 ║
╚═══════╩════════╩════════════╝

Desired Outcome:

╔═══════╦═══════════╦════════════╦══════════╦════════════╗
║ Email ║ FirstType ║ FirstDate  ║ LastType ║  LastDate  ║
╠═══════╬═══════════╬════════════╬══════════╬════════════╣
║ A     ║ null      ║ null       ║ Create   ║ 2019-04-01 ║
╚═══════╩═══════════╩════════════╩══════════╩════════════╝

2 Answers 2

0

Use window functions and conditional aggregation:

select t.email,
       max(case when seqnum = 1 then type end) as first_type,
       max(case when seqnum = 1 then date end) as first_date,
       max(case when seqnum_nonull = 1 and type is not null then type end) as last_type,
       max(case when seqnum_nonull = 1 and type is not null then date end) as last_date
from (select t.*,
             row_number() over (partition by email order by date) as seqnum,
             row_number() over (partition by email, (case when type is null then 1 else 2 end) order by date) as seqnum_nonull
      from t
     ) t
group by t.email;
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the reply - this code returns an error in Databricks that I can't debug. Error in SQL statement: ParseException: mismatched input 'from' expecting <EOF>(line 6, pos 0)
@A.Coustic . . . There was a typo in the definition of seqnum_nonull that might've caused your problem.
The typo fix did solve the problem, however the output isn't exactly as desired. The sample dataset I am using only has 2 records - the first is non-null and the second has null values. The outcome returns the null value as the first event, and the non-null as the second event (I think this is flipped?) The desired outcome should return the first record for both the first and last events. Also I believe there is a typo in the last case statement where it is mixing type and date.
0

As Spark SQL window functions support NULLS LAST|FIRST syntax you could use that then specify a pivot with multiple aggregates for rn values 1 and 2. I could do with seeing some more sample data but this work for your dataset:

%sql
SELECT *, ROW_NUMBER() OVER( PARTITION BY email ORDER BY date NULLS LAST ) rn
FROM tmp;

;WITH cte AS
(
SELECT *, ROW_NUMBER() OVER( PARTITION BY email ORDER BY date NULLS LAST ) rn
FROM tmp
)
SELECT *
FROM cte
PIVOT ( MAX(date), MAX(type) FOR rn In ( 1, 2 ) )

Rename the columns by supplying your required parts in the query, eg

-- Pivot and rename columns
;WITH cte AS
(
SELECT *, ROW_NUMBER() OVER( PARTITION BY email ORDER BY date NULLS LAST ) rn
FROM tmp
)
SELECT *
FROM cte
PIVOT ( MAX(date) AS Date, MAX(type) AS Type FOR rn In ( 1 First, 2 Last ) ) 

Alternately supply a column list, eg

-- Pivot and rename columns
;WITH cte AS
(
SELECT *, ROW_NUMBER() OVER( PARTITION BY email ORDER BY date NULLS LAST ) rn
FROM tmp
), cte2 AS
(
SELECT *
FROM cte
PIVOT ( MAX(date) AS Date, MAX(type) AS Type FOR rn In ( 1 First, 2 Last ) )
) 
SELECT *
FROM cte2 AS (Email, FirstDate, FirstType, LastDate, LastType)

This simple query uses ROW_NUMBER to assign a row number to the dataset ordered by the date column, but using the NULLS LAST syntax to ensure null rows appear last in the numbering. The PIVOT then converts the rows to columns.

5 Comments

Thanks, this seems to work. How can I rename the output columns? Can you also briefly explain what is going on here?
Any update @A.Coustic? Consider marking this as the answer if it works and answers your question.
It looks like the output isn't exactly as desired. The sample dataset I am using only has 2 records, the latter being the null record (in retrospect, should have used this sample for my post). It is returning the non-null values for the first event and the null values for the last event. Desired should return the first record for both first and last events as null is only valid if it is the first record. I can't say what the output is using the sample data originally provided in the post..
Please provide some accurate sample data and expected results.
Please see my edited post. Both examples are use-cases so I am not sure if that affects the code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.