How to sort only one column within a spark dataframe using pyspark?

Question

I have a Spark Dataframe looking like this:

|  time  | col1 | col2 |
|----------------------|
| 123456 |   2  |  A   |
| 123457 |   4  |  B   |
| 123458 |   7  |  C   |
| 123459 |   5  |  D   |
| 123460 |   3  |  E   |
| 123461 |   1  |  F   |
| 123462 |   9  |  G   |
| 123463 |   8  |  H   |
| 123464 |   6  |  I   |

Now I need to sort the "col1" - Column, but the other columns have to remain in the same order: (Using pyspark)

|  time  | col1 | col2 | col1_sorted |
|-----------------------------------|
|  same  | same | same |   sorted   |
|-----------------------------------|
| 123456 |   2  |  A   |     1      |
| 123457 |   4  |  B   |     2      |
| 123458 |   7  |  C   |     3      |
| 123459 |   5  |  D   |     4      |
| 123460 |   3  |  E   |     5      |
| 123461 |   1  |  F   |     6      |
| 123462 |   9  |  G   |     7      |
| 123463 |   8  |  H   |     8      |
| 123464 |   6  |  I   |     9      |

Thanks in advance for any help!

actually this is already a partition

abc
– abc

2020-09-08 07:47:45 +00:00
Commented Sep 8, 2020 at 7:47 — abc
– abc, Commented Sep 8, 2020 at 7:47
I use spark 2.3.1, is there a solution for spark 2.4.x?

abc
– abc

2020-09-09 07:39:14 +00:00
Commented Sep 9, 2020 at 7:39 — abc
– abc, Commented Sep 9, 2020 at 7:39

jxc · Accepted Answer · 2020-09-09 12:03:59Z

2

For Spark 2.3.1, you can try pandas_udf, see below (assume the original dataframe is sorted by the time column)

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType

schema = StructType.fromJson(df.schema.jsonValue()).add('col1_sorted', 'integer')

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def get_col1_sorted(pdf):
  return pdf.sort_values(['time']).assign(col1_sorted=sorted(pdf["col1"]))
  
df.groupby().apply(get_col1_sorted).show()
+------+----+----+-----------+
|  time|col1|col2|col1_sorted|
+------+----+----+-----------+
|123456|   2|   A|          1|
|123457|   4|   B|          2|
|123458|   7|   C|          3|
|123459|   5|   D|          4|
|123460|   3|   E|          5|
|123461|   1|   F|          6|
|123462|   9|   G|          7|
|123463|   8|   H|          8|
|123464|   6|   I|          9|
+------+----+----+-----------+

answered Sep 9, 2020 at 12:03

jxc

14k4 gold badges20 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

abc Over a year ago

I assume this function converts the spark dataframe into a pandas dataframe?

jxc Over a year ago

it does not convert the whole dataframe into pandas df, but do use pandas engine in a distributed way. see spark.apache.org/docs/2.3.1/…

abc · Accepted Answer · 2020-09-09 13:38:53Z

0

My own solution is the following:

First make a copy of df with col1 selected and ordered by col1:

df_copy = df.select("col1").orderBy("col1")

Second indexing both dataframes: (same for df_copy, just with window orderBy("col1"))

w = Window.orderBy("time").rowsBetween(-sys.maxsize, 0)

df = df\
            .withColumn("helper", lit(1))\
            .withColumn("index", lit(0))\
            .withColumn("index", F.col("index")+F.sum(F.col("helper")).over(w))

Last step, rename the col1 to col1_sorted and joining the dataframes

df_copy = df_copy.withColumnRenamed("col1", "col1_sorted")
    
df = df.join(df_copy, df.index == df_copy.index, how="inner")

answered Sep 9, 2020 at 13:38

abc

1772 silver badges18 bronze badges

Collectives™ on Stack Overflow

How to sort only one column within a spark dataframe using pyspark?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related