2

I have a dataframe, df, that looks like this:

+--------+--------------------+
| user_id|        is_following|
+--------+--------------------+
|       1|[2, 3, 4, 5, 6, 7]  |
|       2|[20, 30, 40, 50]    |
+--------+--------------------+

I can confirm this has the schema:

root
 |-- user_id: integer (nullable = true)
 |-- is_following: array (nullable = true)
 |    |-- element: integer (containsNull = true)

I would like to use Spark's ML routines such as LDA to do some machine learning on this, requiring me to convert the is_following column to a linalg.Vector (not a Scala vector). When I try to do this via

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val assembler = new VectorAssembler().setInputCols(Array("is_following")).setOutputCol("features")
val output = assembler.transform(df)

I then get the following error:

java.lang.IllegalArgumentException: Data type ArrayType(IntegerType,true) is not supported.

If I am interpreting that correctly, I take away from it that I need to convert types here from integer to something else. (Double? String?)

My question is what is the best way to convert this array to something that will properly vectorize for the ML pipeline?

EDIT: If it helps, I don't have to structure the dataframe this way. I could instead have it be:

+--------+------------+
| user_id|is_following|
+--------+------------+
|       1|           2|
|       1|           3|
|       1|           4|
|       1|           5|
|       1|           6|
|       1|           7|
|       2|          20|
|     ...|         ...|
+--------+------------+
2
  • Did you try to map to Double and start over? Commented Oct 17, 2017 at 20:30
  • I was able to regenerate the initial table with doubles instead of ints. From there, I tried to re-transform the data with VectorAssembler but got a similar error: java.lang.IllegalArgumentException: Data type ArrayType(DoubleType,true) is not supported. I also can convert the is_following column to double from the edited dataframe (i.e. the one with several identical user_id rows), but this is not really what I want since I need to pass in an array of values rather than one value at a time. Commented Oct 17, 2017 at 21:28

2 Answers 2

2

A simple solution to both converting the array into a linalg.Vector and at the same time convert the integers into doubles would be to use an UDF.

Using your dataframe:

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = spark.createDataFrame(Seq((1, Array(2,3,4,5,6,7)), (2, Array(20,30,40,50))))
  .toDF("user_id", "is_following")

val convertToVector = udf((array: Seq[Int]) => {
  Vectors.dense(array.map(_.toDouble).toArray)
})

val df2 = df.withColumn("is_following", convertToVector($"is_following"))

spark.implicits._ is imported here to allow the use of $, col() or ' could be used instead.

Printing the df2 dataframe will give the wanted results:

+-------+-------------------------+
|user_id|is_following             |
+-------+-------------------------+
|1      |[2.0,3.0,4.0,5.0,6.0,7.0]|
|2      |[20.0,30.0,40.0,50.0]    |
+-------+-------------------------+

schema:

root
 |-- user_id: integer (nullable = false)
 |-- is_following: vector (nullable = true)
Sign up to request clarification or add additional context in comments.

5 Comments

I did exactly this and got the following error: java.lang.IndexOutOfBoundsException: (4,0) not in [-4,4) x [-10,10). Just to be safe, I renamed user_id to label and is_following to features to no avail.
@CJSullivan Hum, strange. Can you confirm your input? In the code above, I create a dataframe that should look exactly like the one in the question. Is there any difference between it and what you are using?
@CJSullivan It could have something to do with the different lengths of the vectors if you are trying to apply some kind of machine learning on it. All feature vectors should be of equal length.
Ah ha. Here is a hint. All features cannot be of equal length here because they represent a social graph and different users might follow different numbers of people.
Although interestingly, I ran our same test DF from above through LDA and there were no errors. The schema I have is identical to yours. However, when I run the full dataframe (which has the identical schema) through LDA, I get that error.
1

So your initial input might be better suited than your transformed input. Spark's VectorAssembler requires that all of the columns be Doubles, and not array's of doubles. Since different users could follow different numbers of people your current structure could be good, you just need to convert the is_following into a Double, you could infact do this with Spark's VectorIndexer https://spark.apache.org/docs/2.1.0/ml-features.html#vectorindexer or just manually do it in SQL.

So the tl;dr is - the type error is because Spark's Vector's only support Doubles (this is changing likely for image data in the not so distant future but isn't a good fit for your use case anyways) and you're alternative structure might actually be better suited (the one without the grouping).

You might find looking at the collaborative filtering example in the Spark documentation useful on your further adventure - https://spark.apache.org/docs/latest/ml-collaborative-filtering.html . Good luck and have fun with Spark ML :)

edit:

I noticed you said you're looking to do LDA on the inputs so let's also look at how to prepare the data for that format. For LDA input you might want to consider using the CountVectorizer (see https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer)

2 Comments

Thanks for the tips! I can easily just use SQL to get is_following into Double. But I am confused because I was hoping to take advantage of the potential speed improvements on cluster detection in a graph by using LDA (which via EM should be graph-based). If I convert this to ArrayType(StringType) for CountVectorizer, will I lose any advantage I had with this being graph data as opposed to a vector of text?
OK...after having run this (and it ran through LDA just fine), I find myself questioning my understanding of CountVectorizer. Really LDA should not care what it receives so long as it is a vector of strings (in this case, those strings happen to actually be integers, but it is pretty irrelevant)? So LDA will still process the data as a graph under the hood and I should not be out anything in terms of speed in the long run, right?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.