How convert Spark dataframe column from Array[Int] to linalg.Vector?

Question

I have a dataframe, df, that looks like this:

+--------+--------------------+
| user_id|        is_following|
+--------+--------------------+
|       1|[2, 3, 4, 5, 6, 7]  |
|       2|[20, 30, 40, 50]    |
+--------+--------------------+

I can confirm this has the schema:

root
 |-- user_id: integer (nullable = true)
 |-- is_following: array (nullable = true)
 |    |-- element: integer (containsNull = true)

I would like to use Spark's ML routines such as LDA to do some machine learning on this, requiring me to convert the is_following column to a linalg.Vector (not a Scala vector). When I try to do this via

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val assembler = new VectorAssembler().setInputCols(Array("is_following")).setOutputCol("features")
val output = assembler.transform(df)

I then get the following error:

java.lang.IllegalArgumentException: Data type ArrayType(IntegerType,true) is not supported.

If I am interpreting that correctly, I take away from it that I need to convert types here from integer to something else. (Double? String?)

My question is what is the best way to convert this array to something that will properly vectorize for the ML pipeline?

EDIT: If it helps, I don't have to structure the dataframe this way. I could instead have it be:

+--------+------------+
| user_id|is_following|
+--------+------------+
|       1|           2|
|       1|           3|
|       1|           4|
|       1|           5|
|       1|           6|
|       1|           7|
|       2|          20|
|     ...|         ...|
+--------+------------+

I was able to regenerate the initial table with doubles instead of ints. From there, I tried to re-transform the data with VectorAssembler but got a similar error: java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,true) is not supported. I also can convert the is_following column to double from the edited dataframe (i.e. the one with several identical user_id rows), but this is not really what I want since I need to pass in an array of values rather than one value at a time. — CJ Sullivan
– CJ Sullivan, Commented Oct 17, 2017 at 21:28

Shaido · Accepted Answer · 2017-10-18 01:27:14Z

2

A simple solution to both converting the array into a linalg.Vector and at the same time convert the integers into doubles would be to use an UDF.

Using your dataframe:

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = spark.createDataFrame(Seq((1, Array(2,3,4,5,6,7)), (2, Array(20,30,40,50))))
  .toDF("user_id", "is_following")

val convertToVector = udf((array: Seq[Int]) => {
  Vectors.dense(array.map(_.toDouble).toArray)
})

val df2 = df.withColumn("is_following", convertToVector($"is_following"))

spark.implicits._ is imported here to allow the use of $, col() or ' could be used instead.

Printing the df2 dataframe will give the wanted results:

+-------+-------------------------+
|user_id|is_following             |
+-------+-------------------------+
|1      |[2.0,3.0,4.0,5.0,6.0,7.0]|
|2      |[20.0,30.0,40.0,50.0]    |
+-------+-------------------------+

schema:

root
 |-- user_id: integer (nullable = false)
 |-- is_following: vector (nullable = true)

answered Oct 18, 2017 at 1:27

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

CJ Sullivan Over a year ago

I did exactly this and got the following error: java.lang.IndexOutOfBoundsException: (4,0) not in [-4,4) x [-10,10). Just to be safe, I renamed user_id to label and is_following to features to no avail.

Shaido Over a year ago

@CJSullivan Hum, strange. Can you confirm your input? In the code above, I create a dataframe that should look exactly like the one in the question. Is there any difference between it and what you are using?

Shaido Over a year ago

@CJSullivan It could have something to do with the different lengths of the vectors if you are trying to apply some kind of machine learning on it. All feature vectors should be of equal length.

CJ Sullivan Over a year ago

Ah ha. Here is a hint. All features cannot be of equal length here because they represent a social graph and different users might follow different numbers of people.

CJ Sullivan Over a year ago

Although interestingly, I ran our same test DF from above through LDA and there were no errors. The schema I have is identical to yours. However, when I run the full dataframe (which has the identical schema) through LDA, I get that error.

Holden · Accepted Answer · 2017-10-17 22:03:25Z

1

So your initial input might be better suited than your transformed input. Spark's VectorAssembler requires that all of the columns be Doubles, and not array's of doubles. Since different users could follow different numbers of people your current structure could be good, you just need to convert the is_following into a Double, you could infact do this with Spark's VectorIndexer https://spark.apache.org/docs/2.1.0/ml-features.html#vectorindexer or just manually do it in SQL.

So the tl;dr is - the type error is because Spark's Vector's only support Doubles (this is changing likely for image data in the not so distant future but isn't a good fit for your use case anyways) and you're alternative structure might actually be better suited (the one without the grouping).

You might find looking at the collaborative filtering example in the Spark documentation useful on your further adventure - https://spark.apache.org/docs/latest/ml-collaborative-filtering.html . Good luck and have fun with Spark ML :)

edit:

I noticed you said you're looking to do LDA on the inputs so let's also look at how to prepare the data for that format. For LDA input you might want to consider using the CountVectorizer (see https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer)

edited Oct 17, 2017 at 22:03

answered Oct 17, 2017 at 21:53

Holden

7,4421 gold badge29 silver badges33 bronze badges

2 Comments

CJ Sullivan Over a year ago

Thanks for the tips! I can easily just use SQL to get is_following into Double. But I am confused because I was hoping to take advantage of the potential speed improvements on cluster detection in a graph by using LDA (which via EM should be graph-based). If I convert this to ArrayType(StringType) for CountVectorizer, will I lose any advantage I had with this being graph data as opposed to a vector of text?

CJ Sullivan Over a year ago

OK...after having run this (and it ran through LDA just fine), I find myself questioning my understanding of CountVectorizer. Really LDA should not care what it receives so long as it is a vector of strings (in this case, those strings happen to actually be integers, but it is pretty irrelevant)? So LDA will still process the data as a graph under the hood and I should not be out anything in terms of speed in the long run, right?

Collectives™ on Stack Overflow

How convert Spark dataframe column from Array[Int] to linalg.Vector?

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related