Comparing columns in Pyspark

Question

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.

For example:

Input: PySpark DataFrame containing :

col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]

Ouput :

col_4 = max(col1, col_2, col_3) = [3,2,5]

There is something similar in pandas as explained in this question.

Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?

if the question is about getting the max value of each column, then it looks like the expected output should be [max(col_1), max(col_2), max(col_3)] = [3, 4, 5] — Quetzalcoatl
– Quetzalcoatl, Commented Sep 22, 2018 at 21:17

zero323 · Accepted Answer · 2017-03-17 18:54:30Z

33

You can reduce using SQL expressions over a list of columns:

from pyspark.sql.functions import max as max_, col, when
from functools import reduce

def row_max(*cols):
    return reduce(
        lambda x, y: when(x > y, x).otherwise(y),
        [col(c) if isinstance(c, str) else c for c in cols]
    )

df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
    .toDF(["a", "b", "c"]))

df.select(row_max("a", "b", "c").alias("max")))

Spark 1.5+ also provides least, greatest

from pyspark.sql.functions import greatest

df.select(greatest("a", "b", "c"))

If you want to keep name of the max you can use `structs:

from pyspark.sql.functions import struct, lit

def row_max_with_name(*cols):
    cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
    return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))

 maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))

And finally you can use above to find select "top" column:

from pyspark.sql.functions import max

((_, c), ) = (maxs
    .groupBy(col("maxs")["col"].alias("col"))
    .count()
    .agg(max(struct(col("count"), col("col"))))
    .first())

df.select(c)

edited Mar 17, 2017 at 18:54

answered Jun 7, 2016 at 9:30

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1569341 Over a year ago

this is very helpful! how do you find second largest instead? I want to get the name of the second largest column

ansev · Accepted Answer · 2020-06-18 09:10:47Z

33

We can use greatest

Creating DataFrame

df = spark.createDataFrame(
    [[1,2,3], [2,1,2], [3,4,5]], 
    ['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
|    1|    2|    3|
|    2|    1|    2|
|    3|    4|    5|
+-----+-----+-----+

Solution

from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))

#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()

+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
|    1|    2|    3|          3|
|    2|    1|    2|          2|
|    3|    4|    5|          5|
+-----+-----+-----+-----------+

edited Jun 18, 2020 at 9:10

answered Dec 23, 2019 at 12:20

ansev

31k5 gold badges21 silver badges33 bronze badges

1 Comment

Jem Over a year ago

If you need the minimum from two columns, use least()

mattexx · Accepted Answer · 2018-07-16 17:00:42Z

13

You can also use the pyspark built-in least:

from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))

answered Jul 16, 2018 at 17:00

mattexx

6,6163 gold badges40 silver badges47 bronze badges

2 Comments

user2739472 Over a year ago

I think OP wants the opposite of this. Is there an equivalent most function?

user2739472 Over a year ago

Ah it's greatest - see @ansev answer below

Rags · Accepted Answer · 2017-03-24 09:10:30Z

0

Another simple way of doing it. Let us say that the below df is your dataframe

df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()

+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10|  1|
|200|  2| 20|
|  3| 30|300|
|400| 40|  4|
+---+---+---+

You can process the above df as below to get the desited results

from pyspark.sql.functions import lit, min

df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
           lit('c2').alias('cn2'), min(df.c2).alias('c2'),
           lit('c3').alias('cn3'), min(df.c3).alias('c3')
          )\
         .rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
         .toDF(['Columnn', 'Min']).show()

+-------+---+
|Columnn|Min|
+-------+---+
|     c1|  3|
|     c2|  2|
|     c3|  1|
+-------+---+

answered Mar 24, 2017 at 9:10

Rags

1,89118 silver badges19 bronze badges

1 Comment

Hemant Over a year ago

You are doing min(col1), whereas I want min(row1), min(row2).. and so on...

Will Vousden · Accepted Answer · 2017-06-15 17:07:05Z

0

Scala solution:

df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))  

df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show

+---+---+---+---+  
| c1| c2| c3|min|  
+---+---+---+---+  
| 10| 10|  1|  1|    
|200|  2| 20|  2|  
|  3| 30|300|  3|  
|400| 40|  4|  4|  
+---+---+---+---+

edited Jun 15, 2017 at 17:07

Will Vousden

33.5k9 gold badges89 silver badges97 bronze badges

answered Jan 30, 2017 at 14:57

Hareesh Adukkadukkam

172 bronze badges

Collectives™ on Stack Overflow

Comparing columns in Pyspark

5 Answers 5

1 Comment

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related