I have a dataframe, df, that looks like this:
+--------+--------------------+
| user_id| is_following|
+--------+--------------------+
| 1|[2, 3, 4, 5, 6, 7] |
| 2|[20, 30, 40, 50] |
+--------+--------------------+
I can confirm this has the schema:
root
|-- user_id: integer (nullable = true)
|-- is_following: array (nullable = true)
| |-- element: integer (containsNull = true)
I would like to use Spark's ML routines such as LDA to do some machine learning on this, requiring me to convert the is_following column to a linalg.Vector (not a Scala vector). When I try to do this via
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().setInputCols(Array("is_following")).setOutputCol("features")
val output = assembler.transform(df)
I then get the following error:
java.lang.IllegalArgumentException: Data type ArrayType(IntegerType,true) is not supported.
If I am interpreting that correctly, I take away from it that I need to convert types here from integer to something else. (Double? String?)
My question is what is the best way to convert this array to something that will properly vectorize for the ML pipeline?
EDIT: If it helps, I don't have to structure the dataframe this way. I could instead have it be:
+--------+------------+
| user_id|is_following|
+--------+------------+
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 2| 20|
| ...| ...|
+--------+------------+
VectorAssemblerbut got a similar error:java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,true) is not supported.I also can convert theis_followingcolumn to double from the edited dataframe (i.e. the one with several identicaluser_idrows), but this is not really what I want since I need to pass in an array of values rather than one value at a time.