PySpark Array<double> is not Array<double>

Question

I am running a very simple Spark (2.4.0 on Databricks) ML script:

from pyspark.ml.clustering import LDA

lda = LDA(k=10, maxIter=100).setFeaturesCol('features')
model = lda.fit(dataset)

But received following error:

IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type array<double>.'

Why my array<double> is not an array<double>?

Here is the schema:

root
 |-- BagOfWords: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = true)

This stackoverflow answer solved this problem for me: stackoverflow.com/questions/55162989/… — Bob Swain
– Bob Swain, Commented Jun 14, 2019 at 19:17

Deepti Aggarwal · Accepted Answer · 2020-04-22 00:44:30Z

1

You probably need to convert it into vector form using vector assembler from pyspark.ml.feature import VectorAssembler

answered Apr 22, 2020 at 0:44

Deepti Aggarwal

191 bronze badge

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark Array<double> is not Array<double>

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related