11

I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD?

> imagerdd = traindf.map(lambda row: map(float, row.image))
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
Traceback (most recent call last):

  File "<ipython-input-21-6fdaa8cde069>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)

  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
values = self._convert_to_array(values, np.float64)

  File     "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
    return np.asarray(array_like, dtype=dtype)

  File "/usr/local/python/conda/lib/python2.7/site-        packages/numpy/core/numeric.py", line 462, in asarray
    return array(a, dtype, copy=False, order=order)

TypeError: float() argument must be a string or a number

I'm getting the same error from every possible arrangement I can think of:

imagerdd = traindf.map(lambda row: Vectors.dense(row.image))
imagerdd = traindf.map(lambda row: row.image)
imagerdd = traindf.map(lambda row: np.array(row.image))

If I try

> imagedf = traindf.select("image")
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)

Traceback (most recent call last):

  File "<ipython-input-26-a8cbdad10291>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)

  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
    values = self._convert_to_array(values, np.float64)

  File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
    return np.asarray(array_like, dtype=dtype)

  File "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray
    return array(a, dtype, copy=False, order=order)

ValueError: setting an array element with a sequence.
0

1 Answer 1

8

Since you didn't provide an example input I'll assume it looks more or less like this where id is a row number and image contains values.

traindf = sqlContext.createDataFrame([
    (1, [1, 2, 3]),
    (2, [4, 5, 6]),
    (3, (7, 8, 9))
], ("id", "image"))

First thing you have to understand is that the DenseMatrix is a local data structure. To be precise it is a wrapper around numpy.ndarray. As for now (Spark 1.4.1) there are no distributed equivalents in PySpark MLlib.

Dense Matrix take three mandatory arguments numRows, numCols, values where values is a local data structure. In your case you have to collect first:

values = (traindf.
    rdd.
    map(lambda r: (r.id, r.image)). # Extract row id and data
    sortByKey(). # Sort by row id
    flatMap(lambda (id, image): image).
    collect())


ncol = len(traindf.rdd.map(lambda r: r.image).first())
nrow = traindf.count()

dm = DenseMatrix(nrow, ncol, values)

Finally:

> print dm.toArray()
[[ 1.  4.  7.]
 [ 2.  5.  8.]
 [ 3.  6.  9.]]

Edit:

In Spark 1.5+ you can use mllib.linalg.distributed as follows:

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

mat = IndexedRowMatrix(traindf.map(lambda row: IndexedRow(*row)))
mat.numRows()
## 4
mat.numCols()
## 3

although as for now API is still to limited to be useful in practice.

Sign up to request clarification or add additional context in comments.

1 Comment

Do you know how to do the same into scala? stackoverflow.com/questions/47010126/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.