1

I Have the below script (i've removed all the column names etc.. to make it easier to see what I am doing at a high level - it was very messy!!)

I need to add a column that is the equivalent of count(*) in SQL.

So if I have grouped user usage by domain I might see the below - where the count is the number of records that match all the prior column conditiosn.

domain.co.uk/ UK User 32433 domain.co.uk/home EU User 43464 etc...

I'm sure it's been asked somewhere on Stackoverflow before, but I've had a good look around and cant find any reference to it!

vpx_cont_filter = vpx_data\
        .coalesce(1000)\
        .join(....)\
        .select(....)\
        .groupBy(....)\
        .agg(
           ....
            )\
        .select(....)

1 Answer 1

5

Do you mean that in your agg, you want to add a column that counts all occurences for each groupBy ?

You can add this then :

.agg(
  F.count(F.lit(1)).alias("total_count"),
  ...
)

By the way, I don't think you're forced to use F.lit(1). In the Spark source code, the have a match case if you specify the star instead of F.lit(1)

// Turn count(*) into count(1)
  case s: Star => Count(Literal(1))
  case _ => Count(e.expr)

So F.count("*") would also work I think

PS : I'm using F. because I assumed you imported the functions package like this

from pyspark.sql import functions as F
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.