Duplicate rows in a Pyspark Dataframe

Question

Let's say I have a dataframe:

df = sqlContext.createDataFrame(
    [(1, 10, 21.0, 0), (3, 14, -23.0, 1)], ("x1", "x2", "x3", "x4"))

df.show()

## +---+---+-----+---+
## | x1| x2|   x3| x4|
## +---+---+-----+---+
## |  1| 10| 23.0|  5|
## |  3| 14|-23.0|  0|
## +---+---+-----+---+

What would be an efficient way to "duplicate" rows and setting x4=1 in those duplicates and have:

## +---+---+-----+---+
## | x1| x2|   x3| x4|
## +---+---+-----+---+
## |  1| 10| 23.0|  5|
## |  1| 10| 23.0|  1|
## |  3| 14|-23.0|  0|
## |  3| 14|-23.0|  1|
## +---+---+-----+---+

In Apache PIG the analog would be simple: do a foreach and generate:

FLATTEN(TOBAG(1, x4)) AS x4

Thank you all

zero323 · Accepted Answer · 2016-10-20 14:54:51Z

4

Import required functions from pyspark.sql.functions:

from pyspark.sql.functions import array, explode, lit

and replace existing column:

df.withColumn("x4", explode(array(lit(1), df["x4"])))

answered Oct 20, 2016 at 14:54

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mubin Over a year ago

+1 , this adds the same row in df, but what if I want to insert n rows, n could be another column value from df

Collectives™ on Stack Overflow

Duplicate rows in a Pyspark Dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related