0

Let's say, I have the following simple Spark Dataframe.

threshold = ?

ID        percentage
B101        0.3
B101        0.3
B202        0.18
B303        0.25

As you can see above, I have to get the threshold value based the ID column. for example, if ID == B101, the threshold value becomes threshold = 0.3. if ID = B202, then the threshold get a new value and becomes threshold = 0.18. The same logic works for the rest. Like this, I have thousands of value and I would like to do this in a simple way.

I tried this:

threshold  = df.first()['ID']  

But I think there should be a loop to go over all the values.

Can anyone help with this in Pyspark?

3
  • so in the end, are you going to have multiple thresholds for each ID? Commented Oct 29, 2021 at 0:50
  • Actually, I was looking for updating/overwriting the threshold value when one particular ID is mentioned in the data frame. So, every time for each particular ID, the threshold value has only one value. Commented Oct 29, 2021 at 7:22
  • I'm still not following, please show some pseudo code/function that describe your expectation better. I think the owner of answer below also confused what what you're asking Commented Oct 29, 2021 at 17:17

1 Answer 1

1

Your DF:

df = spark.createDataFrame(
    [
        ('B101', '0.3'),
        ('B202', '0.18'), 
        ('B303', '0.25')
    ],
    ['ID', 'Percentage']
)

+----+----------+
|  ID|Percentage|
+----+----------+
|B101|       0.3|
|B202|      0.18|
|B303|      0.25|
+----+----------+

A function that will return the Percentage based on an given ID:

import pyspark.sql.functions as F

def threshold(ID):
  return df.filter(F.col('ID') == F.lit(ID)).collect()[0][1]]

calling the function:

threshold('B303')

Out: '0.25'
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the answer. Could it be possible to implement when a value from ID is in the operation, assign the corresponding percentagr to the threshold automatically? I mean without calling specific ID using a function. Since I have thousands of IDs.
Do you have a list of IDs and you want to have the percentage associated to it?
Yes I have but, the thing is the list of IDs are not fixed. New IDs might come at some point. That is why I didn't consider using for this ID, use this percentage logic. I would like to zip like (ID, Percentage) so that for that particular ID, the calculation consider using that percentage.
Have you thought of creating a dataframe from those IDs and join with the dataframe with IDs and Percentage?
Not really. Though I didn't the point as well. Do you mean creating a data frame for each IDs?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.