1

I have a dataframe like this after some operations;

df_new_1 = df_old.filter(df_old["col1"] >= df_old["col2"])
df_new_2 = df_old.filter(df_old["col1"] < df_old["col2"])

print(df_new_1.count(), df_new_2.count())
>> 10, 15

I can find the number of rows individually like above by calling count(). But how can I do this using pyspark sql row operation. i.e aggregating by row. I want to see the result like this;

Row(check1=10, check2=15)
2
  • Show us the pyspark code you have written so far. Commented Dec 27, 2019 at 20:44
  • Here it is @J_H, this is what I tried; df= df_new_1.groupBy("col1").agg({"col1":"count"}).collect().. but not giving the answer. for example to test condition1 Commented Dec 27, 2019 at 21:36

1 Answer 1

3

Since you tagged pyspark-sql, you can do the following:

df_old.createOrReplaceTempView("df_table")

spark.sql("""

    SELECT sum(int(col1 >= col2)) as check1
    ,      sum(int(col1 < col2)) as check2
    FROM df_table

""").collect()

Or use the API functions:

from pyspark.sql.functions import expr

df_old.agg(
    expr("sum(int(col1 >= col2)) as check1"), 
    expr("sum(int(col1 < col2)) as check2")
).collect()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.