0

As the title states, I'd like to create a normalized version of an existing Double column.

As I'm quite new to pyspark, this was my attempt at solving this:

df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df2 = df2.withColumn('count_trans_norm', F.col('count_trans) / (F.max(F.col('count_trans'))))

When I do this, I get the following error:

"grouping expressions sequence is empty, and '`movie_id`' is not an aggregate function.

Any help would be much appreciated.

1 Answer 1

1

You need to specify an empty window if you want to get the maximum of count_trans in df2:

df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df3 = df2.selectExpr('*', 'count_trans / max(count_trans) over () as count_trans_norm')

Or if you prefer pyspark syntax:

from pyspark.sql import functions as F, Window

df3 = df2.withColumn('count_trans_norm', F.col('count_trans') / F.max(F.col('count_trans')).over(Window.orderBy()))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.