0

I need add a column to a spark dataframe, which should be duplicate sequence number, such as [1, 1, 1, 2, 2, 2, 3, 3, 3, ..., 10000, 10000, 10000]. I knew that we can use monotonically_increasing_id to get the sequence number as new column.

val df_new =  df.withColumn("id", monotonically_increasing_id)

Then, what is the solution to extend this function to get the duplicate sequence number? Thanks!

0

1 Answer 1

2

You can calculate a row number, divide that by 3, cast to integer type, and add 1:

import org.apache.spark.sql.expressions.Window

val df_new = df.withColumn(
    "id", 
    (row_number().over(Window.orderBy(monotonically_increasing_id))/3).cast("int") + 1
)
Sign up to request clarification or add additional context in comments.

2 Comments

Nice solution but I think monotonically_increasing_id does not always start from 0.
doesn't matter, the row number starts from 1 always

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.