How to split Spark dataframe rows into columns?

Question

I have a Pyspark dataframe and I would like to split its rows into columns based on unique values of a given column, joining with values of the other column. For illustrative purposes, let me use the following example, where my original dataframe is df.

df.show()
+-----+-----+
| col1| col2|
+-----+-----+
|   z1|   a1|
|   z1|   b2|
|   z1|   c3|
|   x1|   a1|
|   x1|   b2|
|   x1|   c3|
+-----+-----+

What I would like to do is to split on the unique values of col1, thus generating a new column (say, col3) by joining on the values of col2. The resulting dataframe that I am after would look like the following:

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
|   z1|   a1|  x1 |
|   z1|   b2|  x1 |
|   z1|   c3|  x1 |
+-----+-----+-----+

This illustrative example only contains two unique values in col1 (i.e. z1 and x1). Ideally, I would like to write a piece of code which automatically detects unique values in col1 and therefore generates a new corresponding column. Does anyone know where can I start from?

Edit: It is arbitrary that z1 and x1 end up being in col1 and col3, respectively. It could definitely be the other way round since I am simply just interested in splitting by unique values.

Many thanks in advance,

Marioanzas

mck · Accepted Answer · 2021-01-02 09:25:30Z

2

I think you're trying to group by col2 and collect a set of distinct col1 values.

import pyspark.sql.functions as F

df2 = df.groupBy('col2').agg(F.collect_set('col1').alias('col'))
df3 = df2.select(
    'col2',
    *[F.col('col')[i] for i in range(df2.select(F.max(F.size('col'))).head()[0])]
)

df3.show()
+----+------+------+
|col2|col[0]|col[1]|
+----+------+------+
|  b2|    z1|    x1|
|  a1|    z1|    x1|
|  c3|    z1|    x1|
+----+------+------+

Then you can rearrange/rename the columns as you wish.

It's not obvious to me why z1 should go in col1 and x1 in col3 in your question. It could very well be the other way round - there is no way to tell from your logic.

answered Jan 2, 2021 at 9:25

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Marioanzas Over a year ago

Hi @mck , many thanks for your suggestion! I will try it shortly. Regarding the last part of your answer, it is arbitrary that z1 and x1 go in col1 and col3, respectively. As you mention, it could definitely be the other way round! I will update the wording of the question in order to make this clearer. Thanks very much!

Mykola Zotko · Accepted Answer · 2021-01-02 16:17:22Z

You can join using two conditions and dropDuplicates:

df.alias('left').join(
    df.alias('right'),
    on = [
          F.col('left.col2') == F.col('right.col2'),
          F.col('left.col1') != F.col('right.col1')
          ]
).dropDuplicates(['col2']).show()

Output:

+----+----+----+----+
|col1|col2|col1|col2|
+----+----+----+----+
|  x1|  b2|  z1|  b2|
|  x1|  a1|  z1|  a1|
|  x1|  c3|  z1|  c3|
+----+----+----+----+

Then you can drop and rename columns:

df.alias('left').join(
    df.alias('right'),
    on = [
          F.col('left.col2') == F.col('right.col2'),
          F.col('left.col1') != F.col('right.col1')
          ]
).dropDuplicates(['col2'])\
.drop(F.col('right.col2'))\
.select(
    F.col('right.col1'), F.col('col2'), F.col('left.col1').alias('col3')
).show()

Output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  z1|  b2|  x1|
|  z1|  a1|  x1|
|  z1|  c3|  x1|
+----+----+----+

You can also use SQL:

df.createOrReplaceTempView('df')

spark.sql(
    """
    SELECT
      l.col1 AS col1,
      l.col2 AS col2,
      r.col1 AS col3 
    FROM df AS l
    INNER JOIN df AS r
    ON l.col2 = r.col2
    AND l.col1 <> r.col1
    """
).dropDuplicates(['col2']).show()

Output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  z1|  b2|  x1|
|  z1|  a1|  x1|
|  z1|  c3|  x1|
+----+----+----+

Collectives™ on Stack Overflow

How to split Spark dataframe rows into columns?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related