Spark split dataframe based on logic

Question

I have a use-case where I need to deduplicate a dataframe using a column (it's a GUID column). But instead of dumping the duplicates, I need to store them in a separate location. So for e.g., if we have the following data, with schema (name, GUID): (a, 1), (b, 2), (a, 2), (a, 3), (c, 1), (c, 4). I want to split the dataset such that I have: (a, 1), (b, 2), (a, 3), (c, 4) in 1 part and (a, 2), (c, 1) in second part. If I use dropDuplicates(col("GUID")), the second part gets lost. What would be an efficient way to do this?

mck · Accepted Answer · 2021-04-03 18:56:01Z

2

You can assign a row number, and split the dataframe into two parts based on whether the row number is equal to 1.

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rn', 
    F.row_number().over(Window.partitionBy('GUID').orderBy(F.monotonically_increasing_id()))
)
df2.show()
+----+----+---+
|name|GUID| rn|
+----+----+---+
|   a|   1|  1|
|   c|   1|  2|
|   a|   3|  1|
|   b|   2|  1|
|   a|   2|  2|
|   c|   4|  1|
+----+----+---+

df2_part1 = df2.filter('rn = 1').drop('rn')
df2_part2 = df2.filter('rn != 1').drop('rn')

df2_part1.show()
+----+----+
|name|GUID|
+----+----+
|   a|   1|
|   a|   3|
|   b|   2|
|   c|   4|
+----+----+

df2_part2.show()
+----+----+
|name|GUID|
+----+----+
|   c|   1|
|   a|   2|
+----+----+

answered Apr 3, 2021 at 18:56

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark split dataframe based on logic

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related