Sorting a dataframe in PySpark without sql functions

Question

I have some trouble printing out this query with the months sorted in proper order.

Is there a pyspark function command to format the month column in descending order? (without using sql commands)

from pyspark import SparkContext
from pyspark.sql import SQLContext
from operator import add

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

rows = sc.textFile("data.csv")
data = rows.map(lambda line: line.split(","))
header = data.first()

q = data.filter(lambda line: line != header)\
        .map(lambda x:(x[1], 1))\
        .reduceByKey(add)\
        .sortByKey(True)\
        .collect()

sqlContext.createDataFrame(q, ['Month','Total number of operated flights']).show()

+-----+--------------------------------+
|Month|Total number of operated flights|
+-----+--------------------------------+
|    1|                          621559|
|   10|                          629992|
|   11|                          605149|
|   12|                          614139|
|    2|                          565604|
|    3|                          639209|
|    4|                          614648|
|    5|                          631609|
|    6|                          629280|
|    7|                          648560|
|    8|                          653279|
|    9|                          600187|
+-----+--------------------------------+

Dominik Filipiak · Accepted Answer · 2020-06-15 21:48:04Z

1

That's because the month column is treated as string. There are several ways to achieve the correct order. For instance, you can cast the column to int on the RDD call and use sortByKey(False) for the descending order:

q = data.filter(lambda line: line != header)\
    .map(lambda x:(int(x[1]), 1))\
    .reduceByKey(add)\
    .sortByKey(False)\
    .collect()

I'm not quite this will match your expectations, but certainly you don't necessarily need to invoke collect() to make DF from RDD - you can achieve that by running:

df = data.filter(lambda line: line != header)\
 .map(lambda x:(int(x[1]), 1))\
 .reduceByKey(add)\
 .sortByKey(False)
 .toDF()

You can do the casting with the DataFrame API as well with df.withColumn('Month', df['Month'].as(pyspark.sql.types.StringType()), but you've stated you don't want to use Spark SQL.

answered Jun 15, 2020 at 21:48

Dominik Filipiak

1,3022 gold badges19 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Sorting a dataframe in PySpark without sql functions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related