self join in pyspark dataframe with timestamp

Question

I have a spark dataframe as follows

+--+--------+-----------+
|id| account|       time|     
+--+--------+-----------+
| 4|      aa| 01/01/2017|    
| 2|      bb| 03/01/2017|    
| 6|      cc| 04/01/2017|    
| 1|      bb| 05/01/2017|      
| 5|      bb| 09/01/2017|    
| 3|      aa| 02/01/2017|
+--+--------+-----------+

and I want get the data as follows

+---+---+-------+
|id1|id2|account|
+---+---+-------+
|  4|  3|     aa|
|  2|  5|     bb|
|  1|  5|     bb|
|  2|  1|     bb|
+---+---+-------+

so I need find any possible pair within an account, and id1 would be the id with the earlier time and id2 would be the id with later time. I'm very new to pyspark, I think self join maybe a good start.
Anyone can help me with it?

pault · Accepted Answer · 2018-03-27 18:22:05Z

4

IIUC, you can achieve this using a self join:

import pyspark.sql.functions as f
df.alias('l').join(df.alias('r'), on='account')\
    .where('r.time > l.time')\
    .select(f.col('l.id').alias('id1'), f.col('r.id').alias('id2'), 'l.account')\
    .show()
#+---+---+-------+
#|id1|id2|account|
#+---+---+-------+
#|  1|  5|     bb|
#|  2|  1|     bb|
#|  2|  5|     bb|
#|  4|  3|     aa|
#+---+---+-------+

Join the DataFrame (df) to itself on the account. (We alias the left and right DataFrames as 'l' and 'r' respectively.)
Next filter using where to keep only the rows where r.time > l.time.
Everything left will be pairs of ids for the same account where l.id occurs before r.id.

edited Mar 27, 2018 at 18:22

answered Mar 27, 2018 at 15:24

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

self join in pyspark dataframe with timestamp

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related