3

How can I add column with sequence value from a specific number in PySpark data frame?

Current Dataset:

Col1    Col2    Flag
Val1    Val2    F
Val3    Val4    T

But I want the data set to be like this:

Col1    Col2    Flag    New_Col
Val1    Val2    F       11F
Val3    Val4    T       12T

I'm using the below code, in Python.

from pyspark.sql import functions as F
from pyspark.sql import types as T

seq = 10

def fn_increment_id(flag):
    global seq
    seq += 1
    return str(seq) + flag

if __name__ == "__main__":
    df = spark.loadFromMapRDB("path/to/table")
    my_udf = F.UserDefinedFunction(fn_increment_id, T.StringType())
    df = df.withColumn("New_Col", my_udf("Flag"))
    print(df.show(10))

But, I ends up with the result:

Received Dataset:

Col1    Col2    Flag    New_Col
Val1    Val2    F       11F
Val3    Val4    T       11T

So, it incremented by once for all rows. How can I increment for each row? Thanks in advance.

4
  • Do you have a column to order the dataframe by? Commented Aug 15, 2018 at 6:15
  • @Shaido, No, I don't have. In fact, it's not required to order the DF by. Commented Aug 15, 2018 at 6:22
  • So it doesn't matter which row get which sequence value? As long as they are different it's fine? Commented Aug 15, 2018 at 6:23
  • @Shaido, yes exactly, it doesn't matter which row get which sequence value... The values should be different. Also, let me know if there is any soln if rows are ordered by (though this is not required in current project/scenario). Commented Aug 15, 2018 at 6:26

1 Answer 1

4

A column with sequential values can be added by using a Window. This is fine as long as the dataframe is not too big, for larger dataframes you should consider using partitionBy on the window, but the values will not be sequential then.

The below code creates the sequential numbers for each row, adds 10 to it and then concatinate the value with the Flag column to create a new column. Here the rows are sorted by Col1 but any column can be used.

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number, concat

w = Window().orderBy("Col1")
df = df.withColumn("New_Col", concat(row_number().over(w) + 10, col(Flag)))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.