3

I have a spark dataframe as follows

+--+--------+-----------+
|id| account|       time|     
+--+--------+-----------+
| 4|      aa| 01/01/2017|    
| 2|      bb| 03/01/2017|    
| 6|      cc| 04/01/2017|    
| 1|      bb| 05/01/2017|      
| 5|      bb| 09/01/2017|    
| 3|      aa| 02/01/2017|
+--+--------+-----------+

and I want get the data as follows

+---+---+-------+
|id1|id2|account|
+---+---+-------+
|  4|  3|     aa|
|  2|  5|     bb|
|  1|  5|     bb|
|  2|  1|     bb|
+---+---+-------+

so I need find any possible pair within an account, and id1 would be the id with the earlier time and id2 would be the id with later time. I'm very new to pyspark, I think self join maybe a good start.
Anyone can help me with it?

0

1 Answer 1

4

IIUC, you can achieve this using a self join:

import pyspark.sql.functions as f
df.alias('l').join(df.alias('r'), on='account')\
    .where('r.time > l.time')\
    .select(f.col('l.id').alias('id1'), f.col('r.id').alias('id2'), 'l.account')\
    .show()
#+---+---+-------+
#|id1|id2|account|
#+---+---+-------+
#|  1|  5|     bb|
#|  2|  1|     bb|
#|  2|  5|     bb|
#|  4|  3|     aa|
#+---+---+-------+
  • Join the DataFrame (df) to itself on the account. (We alias the left and right DataFrames as 'l' and 'r' respectively.)
  • Next filter using where to keep only the rows where r.time > l.time.
  • Everything left will be pairs of ids for the same account where l.id occurs before r.id.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.