-1

I am converting some code from Pandas to pyspark. In pandas, lets imagine I have the following mock dataframe, df:

enter image description here

And in pandas, I define a certain variable the following way:

value = df.groupby(["Age", "Siblings"]).size()

And the output is a series as follows:

enter image description here

However, when trying to covert this to pyspark, an error comes up: AttributeError: 'GroupedData' object has no attribute 'size'. Can anyone help me solve this?

2 Answers 2

1

The equivalent of size in pyspark is count:

df.groupby(["Age", "Siblings"]).count()
Sign up to request clarification or add additional context in comments.

1 Comment

count returns the number of rows: spark.apache.org/docs/latest/api/python/…
1

You can also use the agg method, which is more flexible as it allows you to set column alias or add other types of aggregations:

import pyspark.sql.functions as F

df.groupby('Age', 'Siblings').agg(F.count('*').alias('count'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.