0

I`m trying to solve exactly this problem: [Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) but without using UDF in Pyspark

I see lot of options in Scala but nothing for Pyspark.

2
  • dense vector or sparse vector? Commented May 13, 2020 at 3:38
  • Hi @Reema, Have you checked my answer? if it is working please upvote + accept. Commented May 19, 2020 at 5:54

2 Answers 2

1

I hope you have predictions dataframe as below-

predictions.show(false)
+-----------+--------------------------------------+------------+--------------------------------------+----------------------------------------------------------+-------------------------------------------------------------+----------+---------------+
|label      |features                              |indexedLabel|indexedFeatures                       |rawPrediction                                             |probability                                                  |prediction|predictedLabel |
+-----------+--------------------------------------+------------+--------------------------------------+----------------------------------------------------------+-------------------------------------------------------------+----------+---------------+
|Iris-setosa|(123,[0,37,82,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[0,37,82,101],[1.0,1.0,1.0,1.0]) |[7.094347002635046,1.7433876811594202,1.1622653162055336] |[0.7094347002635046,0.174338768115942,0.11622653162055337]   |0.0       |Iris-setosa    |
|Iris-setosa|(123,[0,39,58,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[0,39,58,101],[1.0,1.0,1.0,1.0]) |[7.867074275362319,1.2433876811594202,0.8895380434782609] |[0.7867074275362319,0.12433876811594202,0.0889538043478261]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[0,39,62,107],[1.0,1.0,1.0,1.0]) |0.0         |(123,[0,39,62,107],[1.0,1.0,1.0,1.0]) |[5.159492704509035,2.794443583750028,2.046063711740936]   |[0.5159492704509036,0.2794443583750028,0.2046063711740936]   |0.0       |Iris-setosa    |
|Iris-setosa|(123,[2,39,58,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[2,39,58,101],[1.0,1.0,1.0,1.0]) |[7.822379507920459,1.2164981462756994,0.9611223458038423] |[0.7822379507920459,0.12164981462756994,0.09611223458038423] |0.0       |Iris-setosa    |
|Iris-setosa|(123,[2,43,62,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[2,43,62,101],[1.0,1.0,1.0,1.0]) |[7.049652235193186,1.7164981462756992,1.233849618531115]  |[0.7049652235193186,0.17164981462756992,0.1233849618531115]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[2,48,58,107],[1.0,1.0,1.0,1.0]) |0.0         |(123,[2,48,58,107],[1.0,1.0,1.0,1.0]) |[4.375677221716952,3.456351863073957,2.167970915209091]   |[0.4375677221716952,0.3456351863073957,0.21679709152090912]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[4,43,71,108],[1.0,1.0,1.0,1.0]) |0.0         |(123,[4,43,71,108],[1.0,1.0,1.0,1.0]) |[3.521332027976265,3.4847888973511556,2.9938790746725785] |[0.3521332027976265,0.34847888973511554,0.29938790746725785] |0.0       |Iris-setosa    |
|Iris-setosa|(123,[4,54,58,107],[1.0,1.0,1.0,1.0]) |0.0         |(123,[4,54,58,107],[1.0,1.0,1.0,1.0]) |[3.6413022217169515,3.925101863073957,2.433595915209091]  |[0.36413022217169516,0.3925101863073957,0.2433595915209091]  |1.0       |Iris-versicolor|
|Iris-setosa|(123,[10,38,58,110],[1.0,1.0,1.0,1.0])|0.0         |(123,[10,38,58,110],[1.0,1.0,1.0,1.0])|[4.537038655825478,3.3499080646243447,2.1130532795501766] |[0.4537038655825478,0.33499080646243445,0.21130532795501766] |0.0       |Iris-setosa    |
|Iris-setosa|(123,[12,48,58,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[12,48,58,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[12,52,71,107],[1.0,1.0,1.0,1.0])|0.0         |(123,[12,52,71,107],[1.0,1.0,1.0,1.0])|[2.657695664339902,4.252970715532974,3.0893336201271238]  |[0.2657695664339902,0.4252970715532974,0.3089333620127124]   |1.0       |Iris-versicolor|
|Iris-setosa|(123,[13,38,62,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[13,38,62,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[15,39,59,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[15,39,59,101],[1.0,1.0,1.0,1.0])|[9.434782608695652,0.391304347826087,0.17391304347826086] |[0.9434782608695651,0.03913043478260869,0.017391304347826084]|0.0       |Iris-setosa    |
|Iris-setosa|(123,[17,37,59,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[17,37,59,101],[1.0,1.0,1.0,1.0])|[8.662055335968379,0.8913043478260869,0.44664031620553357]|[0.866205533596838,0.08913043478260871,0.044664031620553366] |0.0       |Iris-setosa    |
|Iris-setosa|(123,[17,39,59,108],[1.0,1.0,1.0,1.0])|0.0         |(123,[17,39,59,108],[1.0,1.0,1.0,1.0])|[6.818110128751459,1.6741784322348767,1.5077114390136637] |[0.681811012875146,0.1674178432234877,0.1507711439013664]    |0.0       |Iris-setosa    |
|Iris-setosa|(123,[20,35,63,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[20,35,63,101],[1.0,1.0,1.0,1.0])|[8.393939393939394,0.8833333333333333,0.7227272727272727] |[0.8393939393939395,0.08833333333333333,0.07227272727272727] |0.0       |Iris-setosa    |
|Iris-setosa|(123,[25,37,62,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[25,37,62,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[27,47,63,108],[1.0,1.0,1.0,1.0])|0.0         |(123,[27,47,63,108],[1.0,1.0,1.0,1.0])|[4.916336653323167,2.646676038886771,2.4369873077900626]  |[0.4916336653323167,0.2646676038886771,0.24369873077900625]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[29,48,58,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[29,48,58,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |
|Iris-setosa|(123,[33,35,96,110],[1.0,1.0,1.0,1.0])|0.0         |(123,[33,35,96,110],[1.0,1.0,1.0,1.0])|[2.6034320984484296,4.194443583750028,3.2021243178015424] |[0.26034320984484294,0.41944435837500277,0.32021243178015424]|1.0       |Iris-versicolor|
+-----------+--------------------------------------+------------+--------------------------------------+----------------------------------------------------------+-------------------------------------------------------------+----------+---------------+
only showing top 20 rows

please execute below query to fetch probability vector ** PS : there are other programmatic method to fetch the probability vector, but this is the way i think without using udfs **

predictions.selectExpr("*", "get_json_object(to_json(struct(probability)), '$.probability.values[0]') as fetch_probability").show(false)

predictions.selectExpr("*", "get_json_object(to_json(struct(probability)), '$.probability.values[0]') as fetch_probability").show(false)
+-----------+--------------------------------------+------------+--------------------------------------+----------------------------------------------------------+-------------------------------------------------------------+----------+---------------+-------------------+
|label      |features                              |indexedLabel|indexedFeatures                       |rawPrediction                                             |probability                                                  |prediction|predictedLabel |fetch_probability  |
+-----------+--------------------------------------+------------+--------------------------------------+----------------------------------------------------------+-------------------------------------------------------------+----------+---------------+-------------------+
|Iris-setosa|(123,[0,37,82,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[0,37,82,101],[1.0,1.0,1.0,1.0]) |[7.094347002635046,1.7433876811594202,1.1622653162055336] |[0.7094347002635046,0.174338768115942,0.11622653162055337]   |0.0       |Iris-setosa    |0.7094347002635046 |
|Iris-setosa|(123,[0,39,58,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[0,39,58,101],[1.0,1.0,1.0,1.0]) |[7.867074275362319,1.2433876811594202,0.8895380434782609] |[0.7867074275362319,0.12433876811594202,0.0889538043478261]  |0.0       |Iris-setosa    |0.7867074275362319 |
|Iris-setosa|(123,[0,39,62,107],[1.0,1.0,1.0,1.0]) |0.0         |(123,[0,39,62,107],[1.0,1.0,1.0,1.0]) |[5.159492704509035,2.794443583750028,2.046063711740936]   |[0.5159492704509036,0.2794443583750028,0.2046063711740936]   |0.0       |Iris-setosa    |0.5159492704509036 |
|Iris-setosa|(123,[2,39,58,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[2,39,58,101],[1.0,1.0,1.0,1.0]) |[7.822379507920459,1.2164981462756994,0.9611223458038423] |[0.7822379507920459,0.12164981462756994,0.09611223458038423] |0.0       |Iris-setosa    |0.7822379507920459 |
|Iris-setosa|(123,[2,43,62,101],[1.0,1.0,1.0,1.0]) |0.0         |(123,[2,43,62,101],[1.0,1.0,1.0,1.0]) |[7.049652235193186,1.7164981462756992,1.233849618531115]  |[0.7049652235193186,0.17164981462756992,0.1233849618531115]  |0.0       |Iris-setosa    |0.7049652235193186 |
|Iris-setosa|(123,[2,48,58,107],[1.0,1.0,1.0,1.0]) |0.0         |(123,[2,48,58,107],[1.0,1.0,1.0,1.0]) |[4.375677221716952,3.456351863073957,2.167970915209091]   |[0.4375677221716952,0.3456351863073957,0.21679709152090912]  |0.0       |Iris-setosa    |0.4375677221716952 |
|Iris-setosa|(123,[4,43,71,108],[1.0,1.0,1.0,1.0]) |0.0         |(123,[4,43,71,108],[1.0,1.0,1.0,1.0]) |[3.521332027976265,3.4847888973511556,2.9938790746725785] |[0.3521332027976265,0.34847888973511554,0.29938790746725785] |0.0       |Iris-setosa    |0.3521332027976265 |
|Iris-setosa|(123,[4,54,58,107],[1.0,1.0,1.0,1.0]) |0.0         |(123,[4,54,58,107],[1.0,1.0,1.0,1.0]) |[3.6413022217169515,3.925101863073957,2.433595915209091]  |[0.36413022217169516,0.3925101863073957,0.2433595915209091]  |1.0       |Iris-versicolor|0.36413022217169516|
|Iris-setosa|(123,[10,38,58,110],[1.0,1.0,1.0,1.0])|0.0         |(123,[10,38,58,110],[1.0,1.0,1.0,1.0])|[4.537038655825478,3.3499080646243447,2.1130532795501766] |[0.4537038655825478,0.33499080646243445,0.21130532795501766] |0.0       |Iris-setosa    |0.4537038655825478 |
|Iris-setosa|(123,[12,48,58,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[12,48,58,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |0.6315277235193186 |
|Iris-setosa|(123,[12,52,71,107],[1.0,1.0,1.0,1.0])|0.0         |(123,[12,52,71,107],[1.0,1.0,1.0,1.0])|[2.657695664339902,4.252970715532974,3.0893336201271238]  |[0.2657695664339902,0.4252970715532974,0.3089333620127124]   |1.0       |Iris-versicolor|0.2657695664339902 |
|Iris-setosa|(123,[13,38,62,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[13,38,62,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |0.6315277235193186 |
|Iris-setosa|(123,[15,39,59,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[15,39,59,101],[1.0,1.0,1.0,1.0])|[9.434782608695652,0.391304347826087,0.17391304347826086] |[0.9434782608695651,0.03913043478260869,0.017391304347826084]|0.0       |Iris-setosa    |0.9434782608695651 |
|Iris-setosa|(123,[17,37,59,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[17,37,59,101],[1.0,1.0,1.0,1.0])|[8.662055335968379,0.8913043478260869,0.44664031620553357]|[0.866205533596838,0.08913043478260871,0.044664031620553366] |0.0       |Iris-setosa    |0.866205533596838  |
|Iris-setosa|(123,[17,39,59,108],[1.0,1.0,1.0,1.0])|0.0         |(123,[17,39,59,108],[1.0,1.0,1.0,1.0])|[6.818110128751459,1.6741784322348767,1.5077114390136637] |[0.681811012875146,0.1674178432234877,0.1507711439013664]    |0.0       |Iris-setosa    |0.681811012875146  |
|Iris-setosa|(123,[20,35,63,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[20,35,63,101],[1.0,1.0,1.0,1.0])|[8.393939393939394,0.8833333333333333,0.7227272727272727] |[0.8393939393939395,0.08833333333333333,0.07227272727272727] |0.0       |Iris-setosa    |0.8393939393939395 |
|Iris-setosa|(123,[25,37,62,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[25,37,62,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |0.6315277235193186 |
|Iris-setosa|(123,[27,47,63,108],[1.0,1.0,1.0,1.0])|0.0         |(123,[27,47,63,108],[1.0,1.0,1.0,1.0])|[4.916336653323167,2.646676038886771,2.4369873077900626]  |[0.4916336653323167,0.2646676038886771,0.24369873077900625]  |0.0       |Iris-setosa    |0.4916336653323167 |
|Iris-setosa|(123,[29,48,58,101],[1.0,1.0,1.0,1.0])|0.0         |(123,[29,48,58,101],[1.0,1.0,1.0,1.0])|[6.315277235193186,2.185248146275699,1.499474618531115]   |[0.6315277235193186,0.21852481462756992,0.1499474618531115]  |0.0       |Iris-setosa    |0.6315277235193186 |
|Iris-setosa|(123,[33,35,96,110],[1.0,1.0,1.0,1.0])|0.0         |(123,[33,35,96,110],[1.0,1.0,1.0,1.0])|[2.6034320984484296,4.194443583750028,3.2021243178015424] |[0.26034320984484294,0.41944435837500277,0.32021243178015424]|1.0       |Iris-versicolor|0.26034320984484294|
+-----------+--------------------------------------+------------+--------------------------------------+----------------------------------------------------------+-------------------------------------------------------------+----------+---------------+-------------------+
only showing top 20 rows


Please observe the function to access feature vector, change the index as required starting from 0-

accessing first element from probability vector-
get_json_object(to_json(struct(probability)), '$.probability.values[0]')
accessing second element from probability vector
get_json_object(to_json(struct(probability)), '$.probability.values[1]')
Sign up to request clarification or add additional context in comments.

2 Comments

for denseVector like this, you can just cast the Vector into StringType and then use from_json to convert string into an ArrayType column.
Yes agreed, there can be multiple alternatives to achieve this. If you like my answer, please upvote
0

For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns, you can split like this:

prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df = prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.