pyspark sql: how to count the row with mutiple conditions

Question

I have a dataframe like this after some operations;

df_new_1 = df_old.filter(df_old["col1"] >= df_old["col2"])
df_new_2 = df_old.filter(df_old["col1"] < df_old["col2"])

print(df_new_1.count(), df_new_2.count())
>> 10, 15

I can find the number of rows individually like above by calling count(). But how can I do this using pyspark sql row operation. i.e aggregating by row. I want to see the result like this;

Row(check1=10, check2=15)

Here it is @J_H, this is what I tried; df= df_new_1.groupBy("col1").agg({"col1":"count"}).collect().. but not giving the answer. for example to test condition1 — Mass17
– Mass17, Commented Dec 27, 2019 at 21:36

jxc · Accepted Answer · 2019-12-27 21:51:03Z

3

Since you tagged pyspark-sql, you can do the following:

df_old.createOrReplaceTempView("df_table")

spark.sql("""

    SELECT sum(int(col1 >= col2)) as check1
    ,      sum(int(col1 < col2)) as check2
    FROM df_table

""").collect()

Or use the API functions:

from pyspark.sql.functions import expr

df_old.agg(
    expr("sum(int(col1 >= col2)) as check1"), 
    expr("sum(int(col1 < col2)) as check2")
).collect()

answered Dec 27, 2019 at 21:51

jxc

14k4 gold badges20 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark sql: how to count the row with mutiple conditions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related