How to access spark sparse vector element

Question

I have a sparse vector column obtained through OneHotEncoder in a spark dataframe, basically looking like this showing the first 10 rows:

+------------------------------------+
|check_indexed_encoded               |
+------------------------------------+
|                       (3,[2],[1.0])|
|                       (3,[0],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[0],[1.0])|
+------------------------------------+
only showing top 10 rows

I am trying to access these elements to basically convert it back into (normally) hot encoded dummies to be able to convert the entire frame without issues into Pandas. Within spark I tried using .GetItem and .element but this throws also an error message "Can't extract value: need struct type". Any ideas how to get the values from that? Thanks!

Majte · Accepted Answer · 2020-09-11 18:37:28Z

1

You could use an UDF. This should do it:

import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType
from pyspark.sql.types import ArrayType

vector_udf = F.udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))
df = df.withColumn("check_indexed_encoded_0", vector_udf(train["check_indexed_encoded"]).getItem(0))

For accessing the 2nd elements use getItem(1) etc.

edited Sep 11, 2020 at 18:37

answered Sep 11, 2020 at 10:56

Majte

2942 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to access spark sparse vector element

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related