0

I have the following dataset in pyspark

Id Sub
1 Mat
1 Phy
1 Sci
2 Bio
2 Phy
2 Sci

I want to create a df similar to the one below

Id Sub HaMat
1 Mat 1
1 Phy 1
1 Sci 1
2 Bio 0
2 Phy 0
2 Sci 0

How do I do this in pyspark ?

def hasMath(studentID,df):
    return df.filter(col('Id') == studentID & col('sub') = 'Mat' ).count()

df = df.withColumn("hasMath",hasMath(F.col('id'),df1))

But this doesn't seem to work. IS there a better way to achieve this.

1 Answer 1

1

You can use collect_list over a window and expr (for Spark 2.4+) to get the list of subjects for each ID and to filter for Mat.

The size function gets the count of the array.

from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

w = Window().partitionBy("Id")

df.withColumn("list", F.collect_list(col("Sub")).over(w))\
  .withColumn("hasMath", F.size(F.expr("filter(list, x -> x == 'Mat')")))\
  .drop("list").show()

Output:

+---+---+-------+
| Id|Sub|hasMath|
+---+---+-------+
|  1|Phy|      1|
|  1|Mat|      1|
|  1|Sci|      1|
|  2|Phy|      0|
|  2|Bio|      0|
|  2|Sci|      0|
+---+---+-------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.