Pyspark: Retrieve row value based another row value

Question

Let's say, I have the following simple Spark Dataframe.

threshold = ?

ID        percentage
B101        0.3
B101        0.3
B202        0.18
B303        0.25

As you can see above, I have to get the threshold value based the ID column. for example, if ID == B101, the threshold value becomes threshold = 0.3. if ID = B202, then the threshold get a new value and becomes threshold = 0.18. The same logic works for the rest. Like this, I have thousands of value and I would like to do this in a simple way.

I tried this:

threshold  = df.first()['ID']

But I think there should be a loop to go over all the values.

Can anyone help with this in Pyspark?

so in the end, are you going to have multiple thresholds for each ID? — pltc
– pltc, Commented Oct 29, 2021 at 0:50
Actually, I was looking for updating/overwriting the threshold value when one particular ID is mentioned in the data frame. So, every time for each particular ID, the threshold value has only one value. — Hiwot
– Hiwot, Commented Oct 29, 2021 at 7:22
I'm still not following, please show some pseudo code/function that describe your expectation better. I think the owner of answer below also confused what what you're asking — pltc
– pltc, Commented Oct 29, 2021 at 17:17

Luiz Viola · Accepted Answer · 2021-10-28 20:58:51Z

1

Your DF:

df = spark.createDataFrame(
    [
        ('B101', '0.3'),
        ('B202', '0.18'), 
        ('B303', '0.25')
    ],
    ['ID', 'Percentage']
)

+----+----------+
|  ID|Percentage|
+----+----------+
|B101|       0.3|
|B202|      0.18|
|B303|      0.25|
+----+----------+

A function that will return the Percentage based on an given ID:

import pyspark.sql.functions as F

def threshold(ID):
  return df.filter(F.col('ID') == F.lit(ID)).collect()[0][1]]

calling the function:

threshold('B303')

Out: '0.25'

answered Oct 28, 2021 at 20:58

Luiz Viola

2,4642 gold badges17 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Hiwot Over a year ago

Thanks for the answer. Could it be possible to implement when a value from ID is in the operation, assign the corresponding percentagr to the threshold automatically? I mean without calling specific ID using a function. Since I have thousands of IDs.

Luiz Viola Over a year ago

Do you have a list of IDs and you want to have the percentage associated to it?

Hiwot Over a year ago

Yes I have but, the thing is the list of IDs are not fixed. New IDs might come at some point. That is why I didn't consider using for this ID, use this percentage logic. I would like to zip like (ID, Percentage) so that for that particular ID, the calculation consider using that percentage.

Luiz Viola Over a year ago

Have you thought of creating a dataframe from those IDs and join with the dataframe with IDs and Percentage?

Hiwot Over a year ago

Not really. Though I didn't the point as well. Do you mean creating a data frame for each IDs?

|

Collectives™ on Stack Overflow

Pyspark: Retrieve row value based another row value

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related