How to add a column with duplicate sequence number for spark dataframe in scala?

Question

I need add a column to a spark dataframe, which should be duplicate sequence number, such as [1, 1, 1, 2, 2, 2, 3, 3, 3, ..., 10000, 10000, 10000]. I knew that we can use monotonically_increasing_id to get the sequence number as new column.

val df_new =  df.withColumn("id", monotonically_increasing_id)

Then, what is the solution to extend this function to get the duplicate sequence number? Thanks!

mck · Accepted Answer · 2021-03-26 07:36:03Z

2

You can calculate a row number, divide that by 3, cast to integer type, and add 1:

import org.apache.spark.sql.expressions.Window

val df_new = df.withColumn(
    "id", 
    (row_number().over(Window.orderBy(monotonically_increasing_id))/3).cast("int") + 1
)

answered Mar 26, 2021 at 7:36

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Test Mirror Over a year ago

Nice solution but I think monotonically_increasing_id does not always start from 0.

mck Over a year ago

doesn't matter, the row number starts from 1 always

Collectives™ on Stack Overflow

How to add a column with duplicate sequence number for spark dataframe in scala?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related