PySpark count values by condition

Question

I have a DataFrame, a snippet here:

[['u1', 1], ['u2', 0]]

basically a string field named f and either a 1 or a 0 for second element (is_fav).

What I need to do is grouping on the first field and counting the occurrences of 1s and 0s. I was hoping to do something like

num_fav = count((col("is_fav") == 1)).alias("num_fav")

num_nonfav = count((col("is_fav") == 0)).alias("num_nonfav")

df.groupBy("f").agg(num_fav, num_nonfav)

It does not work properly, I get in both cases the same result which amounts to the count for the items in the group, so the filter (whether it is a 1 or a 0) seems to be ignored. Does this depend on how count works?

zero323 · Accepted Answer · 2016-03-18 01:04:00Z

16

There is no filter here. Both col("is_fav") == 1 and col("is_fav") == 0) are just boolean expressions and count doesn't really care about their value as long as it is defined.

There are many ways you can solve this for example by using simple sum:

from pyspark.sql.functions import sum, abs

gpd = df.groupBy("f")
gpd.agg(
    sum("is_fav").alias("fv"),
    (count("is_fav") - sum("is_fav")).alias("nfv")
)

or making ignored values undefined (a.k.a NULL):

exprs = [
    count(when(col("is_fav") == x, True)).alias(c)
    for (x, c) in [(1, "fv"), (0, "nfv")]
]
gpd.agg(*exprs)

edited Mar 18, 2016 at 1:04

answered Mar 18, 2016 at 0:41

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

martina.physics Over a year ago

I get a TypeError: unsupported operand type(s) for +: 'int' and 'str' on the first solution, in sum. I'm sure the column 'is_fav' contains an IntegerType so I don't understand?

zero323 Over a year ago

Missing imports or some conflicts with sum function? I mean you use built-in sum not functions.sum.

martina.physics Over a year ago

Yes that was it, missing import. I guess it was confusing with the built-in Python "sum".

Collectives™ on Stack Overflow

PySpark count values by condition

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related