2

I have a Pyspark dataframe and I would like to split its rows into columns based on unique values of a given column, joining with values of the other column. For illustrative purposes, let me use the following example, where my original dataframe is df.

df.show()
+-----+-----+
| col1| col2|
+-----+-----+
|   z1|   a1|
|   z1|   b2|
|   z1|   c3|
|   x1|   a1|
|   x1|   b2|
|   x1|   c3|
+-----+-----+

What I would like to do is to split on the unique values of col1, thus generating a new column (say, col3) by joining on the values of col2. The resulting dataframe that I am after would look like the following:

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
|   z1|   a1|  x1 |
|   z1|   b2|  x1 |
|   z1|   c3|  x1 |
+-----+-----+-----+

This illustrative example only contains two unique values in col1 (i.e. z1 and x1). Ideally, I would like to write a piece of code which automatically detects unique values in col1 and therefore generates a new corresponding column. Does anyone know where can I start from?

Edit: It is arbitrary that z1 and x1 end up being in col1 and col3, respectively. It could definitely be the other way round since I am simply just interested in splitting by unique values.

Many thanks in advance,

Marioanzas

2 Answers 2

2

I think you're trying to group by col2 and collect a set of distinct col1 values.

import pyspark.sql.functions as F

df2 = df.groupBy('col2').agg(F.collect_set('col1').alias('col'))
df3 = df2.select(
    'col2',
    *[F.col('col')[i] for i in range(df2.select(F.max(F.size('col'))).head()[0])]
)

df3.show()
+----+------+------+
|col2|col[0]|col[1]|
+----+------+------+
|  b2|    z1|    x1|
|  a1|    z1|    x1|
|  c3|    z1|    x1|
+----+------+------+

Then you can rearrange/rename the columns as you wish.

It's not obvious to me why z1 should go in col1 and x1 in col3 in your question. It could very well be the other way round - there is no way to tell from your logic.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi @mck , many thanks for your suggestion! I will try it shortly. Regarding the last part of your answer, it is arbitrary that z1 and x1 go in col1 and col3, respectively. As you mention, it could definitely be the other way round! I will update the wording of the question in order to make this clearer. Thanks very much!
1

You can join using two conditions and dropDuplicates:

df.alias('left').join(
    df.alias('right'),
    on = [
          F.col('left.col2') == F.col('right.col2'),
          F.col('left.col1') != F.col('right.col1')
          ]
).dropDuplicates(['col2']).show()

Output:

+----+----+----+----+
|col1|col2|col1|col2|
+----+----+----+----+
|  x1|  b2|  z1|  b2|
|  x1|  a1|  z1|  a1|
|  x1|  c3|  z1|  c3|
+----+----+----+----+

Then you can drop and rename columns:

df.alias('left').join(
    df.alias('right'),
    on = [
          F.col('left.col2') == F.col('right.col2'),
          F.col('left.col1') != F.col('right.col1')
          ]
).dropDuplicates(['col2'])\
.drop(F.col('right.col2'))\
.select(
    F.col('right.col1'), F.col('col2'), F.col('left.col1').alias('col3')
).show()

Output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  z1|  b2|  x1|
|  z1|  a1|  x1|
|  z1|  c3|  x1|
+----+----+----+

You can also use SQL:

df.createOrReplaceTempView('df')

spark.sql(
    """
    SELECT
      l.col1 AS col1,
      l.col2 AS col2,
      r.col1 AS col3 
    FROM df AS l
    INNER JOIN df AS r
    ON l.col2 = r.col2
    AND l.col1 <> r.col1
    """
).dropDuplicates(['col2']).show()

Output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  z1|  b2|  x1|
|  z1|  a1|  x1|
|  z1|  c3|  x1|
+----+----+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.