Converting series from pandas to pyspark: need to use "groupby" and "size", but pyspark yields error

Question

I am converting some code from Pandas to pyspark. In pandas, lets imagine I have the following mock dataframe, df:

And in pandas, I define a certain variable the following way:

value = df.groupby(["Age", "Siblings"]).size()

And the output is a series as follows:

However, when trying to covert this to pyspark, an error comes up: AttributeError: 'GroupedData' object has no attribute 'size'. Can anyone help me solve this?

akuiper · Accepted Answer · 2021-01-13 17:56:18Z

1

The equivalent of size in pyspark is count:

df.groupby(["Age", "Siblings"]).count()

edited Jan 13, 2021 at 17:56

answered Jan 13, 2021 at 17:54

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

akuiper Over a year ago

count returns the number of rows: spark.apache.org/docs/latest/api/python/…

mck · Accepted Answer · 2021-01-13 19:04:51Z

1

You can also use the agg method, which is more flexible as it allows you to set column alias or add other types of aggregations:

import pyspark.sql.functions as F

df.groupby('Age', 'Siblings').agg(F.count('*').alias('count'))

answered Jan 13, 2021 at 19:04

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

Converting series from pandas to pyspark: need to use "groupby" and "size", but pyspark yields error

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related