5

I am running a very simple Spark (2.4.0 on Databricks) ML script:

from pyspark.ml.clustering import LDA

lda = LDA(k=10, maxIter=100).setFeaturesCol('features')
model = lda.fit(dataset)

But received following error:

IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type array<double>.'

Why my array<double> is not an array<double>?

Here is the schema:

root
 |-- BagOfWords: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = true)
3
  • 1
    Can please post the output of dataset.printSchema()? Commented Apr 11, 2019 at 19:49
  • @cronoik there you go Commented Apr 11, 2019 at 20:59
  • 1
    This stackoverflow answer solved this problem for me: stackoverflow.com/questions/55162989/… Commented Jun 14, 2019 at 19:17

1 Answer 1

1

You probably need to convert it into vector form using vector assembler from pyspark.ml.feature import VectorAssembler

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.