4

I have a sparse vector column obtained through OneHotEncoder in a spark dataframe, basically looking like this showing the first 10 rows:

+------------------------------------+
|check_indexed_encoded               |
+------------------------------------+
|                       (3,[2],[1.0])|
|                       (3,[0],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[0],[1.0])|
+------------------------------------+
only showing top 10 rows

I am trying to access these elements to basically convert it back into (normally) hot encoded dummies to be able to convert the entire frame without issues into Pandas. Within spark I tried using .GetItem and .element but this throws also an error message "Can't extract value: need struct type". Any ideas how to get the values from that? Thanks!

1 Answer 1

1

You could use an UDF. This should do it:

import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType
from pyspark.sql.types import ArrayType

vector_udf = F.udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))
df = df.withColumn("check_indexed_encoded_0", vector_udf(train["check_indexed_encoded"]).getItem(0))

For accessing the 2nd elements use getItem(1) etc.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.