2

I have dataframe like this:

+------+-----+-------------------+--------------------+
|    Id|Label|          Timestamp|         Signal_list|
+------+-----+-------------------+--------------------+
|A05439|    1|2014-05-20 05:05:21|[-116, -123, -129...|
|A06392|    1|2013-12-27 04:12:33|[260, 314, 370, 4...|
|A08192|    1|2014-06-03 04:06:15|[334, 465, 628, 8...|
|A08219|    3|2013-12-31 03:12:41|[-114, -140, -157...|
|A02894|    2|2013-10-28 06:10:53|[109, 139, 170, 1...|

This dataframe signal list have 9k elements, I want to convert the signal list column into vector. I tried the below UDF :

import org.apache.spark.ml.linalg._

val convertUDF = udf((array : Seq[Long]) => {
  Vectors.dense(array.toArray)
})
val afWithVector = afLabel.select("*").withColumn("Signal_list", convertUDF($"Signal_list"))

But it gives error:

console>:39: error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.ml.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.ml.linalg.Vector
 cannot be applied to (Array[Long])
         Vectors.dense(array.toArray)

Dataframe schema:

|-- Id: string (nullable = true)
 |-- Label: integer (nullable = true)
 |-- Timestamp: string (nullable = true)
 |-- Signal_list: array (nullable = true)
 |    |-- element: long (containsNull = true)

I'm new at using scala, an answer using pyspark will be more helpful.

2 Answers 2

4

The UDF is nearly correct. The problem lies in that a vector in Spark can only use doubles, longs are not accepted. The change would look like this in Scala:

val convertUDF = udf((array : Seq[Long]) => {
  Vectors.dense(array.toArray.map(_.toDouble))
})

In Python I believe it would look like this:

udf(lambda vs: Vectors.dense([float(i) for i in vs]), VectorUDT())
Sign up to request clarification or add additional context in comments.

Comments

0

Providing a pyspark answer here.
if you are using spark 3.1.0, your problem can be simply solved by using:

from pyspark.ml.functions import array_to_vector
dataframe= dataframe.withColumn(feature_name_old, array_to_vector(feature_name_new))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.