1

I have an input dataframe input_df as:

+---------------+--------------------+
|Main_CustomerID|              Vector|
+---------------+--------------------+
|         725153|[3.0,2.0,6.0,0.0,9.0|
|         873008|[4.0,1.0,0.0,1.0,...|
|         625109|[1.0,0.0,6.0,1.0,...|
|         817171|[0.0,4.0,0.0,7.0,...|
|         611498|[1.0,0.0,4.0,5.0,...|
+---------------+--------------------+

The input_df is of schema type,

root
 |-- Main_CustomerID: integer (nullable = true)
 |-- Vector: vector (nullable = true)

By referring to Calculate Cosine Similarity Spark Dataframe, I have created the indexed row matrix and then I do:

val lm = irm.toIndexedRowMatrix.toBlockMatrix.toLocalMatrix 

to find cosine similarity between columns. Now I have a resultant mllib matrix,

cosineSimilarity: org.apache.spark.mllib.linalg.Matrix =
0.0  0.4199605255658081  0.5744269579035528  0.22075539284417395  0.561434614044346
0.0  0.0                 0.2791452631195413  0.7259079527665503   0.6206918387272496
0.0  0.0                 0.0                 0.31792539222893695  0.6997167152675132
0.0  0.0                 0.0                 0.0                  0.6776404124278828
0.0  0.0                 0.0                 0.0                  0.0

Now, I need to convert this lm which is of type org.apache.spark.mllib.linalg.Matrix into a dataframe. I expect my output dataframe to look as follows:

+---+------------------+------------------+-------------------+------------------+
| _1|                _2|                _3|                 _4|                _5|
+---+------------------+------------------+-------------------+------------------+
|0.0|0.4199605255658081|0.5744269579035528|0.22075539284417395| 0.561434614044346|
|0.0|               0.0|0.2791452631195413| 0.7259079527665503|0.6206918387272496|
|0.0|               0.0|               0.0|0.31792539222893695|0.6997167152675132|
|0.0|               0.0|               0.0|                0.0|0.6776404124278828|
|0.0|               0.0|               0.0|                0.0|               0.0|
+---+------------------+------------------+-------------------+------------------+

How can I do this in Scala?

0

1 Answer 1

2

To convert the Matrix to a dataframe as specified, do the following. It first converts the matrix to a dataframe containing a single column with an array. Then foldLeft is used to break the array into separate columns.

import spark.implicits._
val cols = (0 until lm.numCols).toSeq

val df = lm.transpose
  .colIter.toSeq
  .map(_.toArray)
  .toDF("arr")

val df2 = cols.foldLeft(df)((df, i) => df.withColumn("_" + (i+1), $"arr"(i)))
  .drop("arr")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.