PySpark count rows on condition

Question

I have a dataframe

test = spark.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330), ('bn', 2, 220), ('mb', 14520, 331)], ['x', 'y', 'z'])
test.show()
# +---+-----+---+
# |  x|    y|  z|
# +---+-----+---+
# | bn|12452|221|
# | mb|14521|330|
# | bn|    2|220|
# | mb|14520|331|
# +---+-----+---+

I need to count the rows based on a condition:

test.groupBy("x").agg(count(col("y") > 12453), count(col("z") > 230)).show()

which gives

+---+------------------+----------------+
|  x|count((y > 12453))|count((z > 230))|
+---+------------------+----------------+
| bn|                 2|               2|
| mb|                 2|               2|
+---+------------------+----------------+

It's just the count of the rows, not the count for certain conditions.

akuiper · Accepted Answer · 2018-02-28 04:55:05Z

76

count doesn't sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then sum:

import pyspark.sql.functions as F

cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0))
test.groupBy('x').agg(
    cnt_cond(F.col('y') > 12453).alias('y_cnt'), 
    cnt_cond(F.col('z') > 230).alias('z_cnt')
).show()
+---+-----+-----+
|  x|y_cnt|z_cnt|
+---+-----+-----+
| bn|    0|    0|
| mb|    2|    2|
+---+-----+-----+

edited Feb 28, 2018 at 4:55

answered Feb 28, 2018 at 4:47

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

mommomonthewind Over a year ago

From the show table, is there a way I could extract the values to Python variable? stackoverflow.com/questions/53689509/…

Chogg Over a year ago

Can I just check my pyspark understanding here: the lambda function here is all in spark, so this never has to create a user defined python function, with the associated slow downs. Correct? This looks very handy.

Ramin Melikov Over a year ago

@Psidom, could you help me with my conditional count problem? stackoverflow.com/questions/64470031/…

Anconia · Accepted Answer · 2020-03-06 16:36:15Z

38

Based on @Psidom answer, my answer is as following

from pyspark.sql.functions import col,when,count

test.groupBy("x").agg(
    count(when(col("y") > 12453, True)),
    count(when(col("z") > 230, True))
).show()

edited Mar 6, 2020 at 16:36

Anconia

4,0686 gold badges40 silver badges66 bronze badges

answered Feb 28, 2018 at 5:37

newleaf

2,4979 gold badges39 silver badges58 bronze badges

1 Comment

Gaddy Over a year ago

Note that the True value here is not necessary - any non null value would achieve the same result, as count() counts non null.

ZygD · Accepted Answer · 2023-09-25 23:25:52Z

4

Spark 3.5+ has count_if in Python API:

from pyspark.sql import functions as F

test.groupBy('x').agg(
    F.count_if(F.col('y') > 12453).alias('y_cnt'),
    F.count_if(F.col('z') > 230).alias('z_cnt')
).show()
# +---+-----+-----+
# |  x|y_cnt|z_cnt|
# +---+-----+-----+
# | bn|    0|    0|
# | mb|    2|    2|
# +---+-----+-----+

Spark 3.0+ has it too, but expr must be used:

test.groupBy('x').agg(
    F.expr("count_if(y > 12453) y_cnt"),
    F.expr("count_if(z > 230) z_cnt")
).show()
# +---+-----+-----+
# |  x|y_cnt|z_cnt|
# +---+-----+-----+
# | bn|    0|    0|
# | mb|    2|    2|
# +---+-----+-----+

edited Sep 25, 2023 at 23:25

answered Sep 25, 2023 at 23:18

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Comments

mahdi hazratgholizadeh · Accepted Answer · 2022-11-10 15:54:54Z

3

count function skip null values so you can try this:

import pyspark.sql.functions as F

def count_with_condition(cond):
    return F.count(F.when(cond, True))

and also function in this repo: kolang

edited Nov 10, 2022 at 15:54

answered Oct 26, 2021 at 9:03

mahdi hazratgholizadeh

393 bronze badges

Comments

rwitzel · Accepted Answer · 2022-06-11 13:55:15Z

2

Since Spark 3.0.0 there is count_if(exp), see Spark function documentation

answered Jun 11, 2022 at 13:55

rwitzel

1,86020 silver badges24 bronze badges

1 Comment

r.a.shehni Over a year ago

I try count_if(exp) in pyspark 3.1.2 but this is not in pyspark.sql.functions so by this link spark.apache.org/docs/3.1.1/sql-ref.html this is a Built-in Aggregate Functions use for sql query so it is better to explain and make example in answer cause the answers that are barely more than a link to an external site may be deleted

Collectives™ on Stack Overflow

PySpark count rows on condition

5 Answers 5

3 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related