2

I am using a vector assembler to transform a dataframe.

var stringAssembler = new VectorAssembler().setInputCols(encodedstringColumns).setOutputCol("stringFeatures")
df = stringAssembler.transform(df)
**var stringVectorSize = df.select("stringFeatures").head.size**
var stringPca = new PCA().setInputCol("stringFeatures").setOutputCol("pcaStringFeatures").setK(stringVectorSize).fit(output)

Now stringVectorSize will tell PCA how many columns to keep while performing pca. I am trying to get the size of the output sparse vector from the vector assembler but my code gives size = 1 which is wrong. What is the right code to get the size of a sparse vector which is the part of a dataframe column.

To put it plainly

+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|PetalLengthCm|PetalWidthCm|SepalLengthCm|SepalWidthCm| Id|    Species|Species_Encoded|       Id_Encoded|      stringFeatures|
+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|          1.4|         0.2|          5.1|         3.5|  1|Iris-setosa|  (2,[0],[1.0])| (149,[91],[1.0])|(151,[91,149],[1....|
|          1.4|         0.2|          4.9|         3.0|  2|Iris-setosa|  (2,[0],[1.0])|(149,[119],[1.0])|(151,[119,149],[1...|
|          1.3|         0.2|          4.7|         3.2|  3|Iris-setosa|  (2,[0],[1.0])|(149,[140],[1.0])|(151,[140,149],[1...|

For the above dataframe . I want to extract the size of stringFeatures sparse vector ( which is 151)

1 Answer 1

3

If you read DataFrame's documentation you will notice that the head method returns a Row. Therefore, rather than obtaining your SparseVector's size, you are obtaining Row's size. Thus, to solve this you have to extract the element stored in the Row.

val row = df.select("stringFeatures").head 
val vector = vector(0).asInstanceOf[SparseVector]
val size = vector.size

For instance:

import sqlContext.implicits._
import org.apache.spark.mllib.linalg.SparseVector

val df = sc.parallelize(Array(10,2,3,4)).toDF("n")
val pepe = udf((i: Int) => new SparseVector(i, Array(i-1), Array(i)))
val x = df.select(pepe(df("n")).as("n"))

x.show()

+---------------+
|              n|
+---------------+
|(10,[9],[10.0])|
|  (2,[1],[2.0])|
|  (3,[2],[3.0])|
|  (4,[3],[4.0])|
+---------------+

val y = x.select("n").head

y(0).asInstanceOf[SparseVector].size
res12: Int = 10
Sign up to request clarification or add additional context in comments.

2 Comments

var stringVectorSize = df.select("stringFeatures").head(0) stringVectorSize.size This returns stringVectorSize: Array[org.apache.spark.sql.Row] = Array() res644: Int = 0
Yea it works. I was typing in df.select("stringFeatures").head.asInstanceOf[SparseVector] which was causing an error. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.