Pyspark : Query self dataframe to create a new column

Question

I have the following dataset in pyspark

Id	Sub
1	Mat
1	Phy
1	Sci
2	Bio
2	Phy
2	Sci

I want to create a df similar to the one below

Id	Sub	HaMat
1	Mat	1
1	Phy	1
1	Sci	1
2	Bio	0
2	Phy	0
2	Sci	0

How do I do this in pyspark ?

def hasMath(studentID,df):
    return df.filter(col('Id') == studentID & col('sub') = 'Mat' ).count()

df = df.withColumn("hasMath",hasMath(F.col('id'),df1))

But this doesn't seem to work. IS there a better way to achieve this.

Surya · Accepted Answer · 2022-05-16 06:14:23Z

1

You can use collect_list over a window and expr (for Spark 2.4+) to get the list of subjects for each ID and to filter for Mat.

The size function gets the count of the array.

from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

w = Window().partitionBy("Id")

df.withColumn("list", F.collect_list(col("Sub")).over(w))\
  .withColumn("hasMath", F.size(F.expr("filter(list, x -> x == 'Mat')")))\
  .drop("list").show()

Output:

+---+---+-------+
| Id|Sub|hasMath|
+---+---+-------+
|  1|Phy|      1|
|  1|Mat|      1|
|  1|Sci|      1|
|  2|Phy|      0|
|  2|Bio|      0|
|  2|Sci|      0|
+---+---+-------+

answered May 16, 2022 at 6:14

Surya

3,4293 gold badges22 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark : Query self dataframe to create a new column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related