4

I am very new to using PySpark. I have a column of SparseVectors in my PySpark dataframe.

rescaledData.select('features').show(5,False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(262144,[43953,62425,66522,148962,174441,249180],[3.9219733362813143,3.9219733362813143,1.213923135179104,3.9219733362813143,3.9219733362813143,0.5720692490067093])|
|(262144,[57925,66522,90939,249180],[3.5165082281731497,1.213923135179104,3.9219733362813143,0.5720692490067093])                                                    |
|(262144,[23366,45531,73408,211290],[2.6692103677859462,3.005682604407159,3.5165082281731497,3.228826155721369])                                                     |
|(262144,[30913,81939,99546,137643,162885,249180],[3.228826155721369,3.9219733362813143,3.005682604407159,3.005682604407159,3.228826155721369,1.1441384980134186])   |
|(262144,[108134,152329,249180],[3.9219733362813143,2.6692103677859462,2.8603462450335466])                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I need to convert the above dataframe into a matrix where every row in the matrix corresponds to a SparseVector in that exact row in the dataframe.

for example,

+-----------------+
|features         |
+-----------------+
|(7,[1,2],[45,63])|
|(7,[3,5],[85,69])|
|(7,[1,2],[89,56])|
+-----------------+

Must be converted to

[[0,45,63,0,0,0,0]
[0,0,0,85,0,69,0]
[0,89,56,0,0,0,0]]

I have read the link below, which shows that there is a function toArray() which does exactly what I want. https://mingchen0919.github.io/learning-apache-spark/pyspark-vectors.html

However, I am having trouble using it.

vector_udf = udf(lambda vector: vector.toArray())
rescaledData.withColumn('features_', vector_udf(rescaledData.features)).first()

I need it to convert every row into an array and then convert the PySpark dataframe into a matrix.

1
  • 1
    None of the answers good enough to accept, or at least upvote as useful?? Commented Dec 8, 2017 at 10:38

2 Answers 2

7

toArray() will return numpy array. we can convert to list and then collect the dataframe.

from pyspark.sql.types import *
vector_udf = udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))

df.show() ## my sample dataframe
+-------------------+
|           features|
+-------------------+
|(4,[1,3],[3.0,4.0])|
|(4,[1,3],[3.0,4.0])|
|(4,[1,3],[3.0,4.0])|
+-------------------+

colvalues = df.select(vector_udf('features').alias('features')).collect()

list(map(lambda x:x.features,colvalues))
[[0.0, 3.0, 0.0, 4.0], [0.0, 3.0, 0.0, 4.0], [0.0, 3.0, 0.0, 4.0]]
Sign up to request clarification or add additional context in comments.

Comments

7

Convert to RDD and map:

vectors = df.select("features").rdd.map(lambda row: row.features)

Convert result to distributed matrix:

from pyspark.mllib.linalg.distributed import RowMatrix

matrix = RowMatrix(vectors)

If you want DenseVectors (memory requirements!):

vectors = df.select("features").rdd.map(lambda row: row.features.toArray())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.