6
pyspark==2.4.0

Here is the code giving the exception:

LDA = spark.read.parquet('./LDA.parquet/')
LDA.printSchema()

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

kmeans = KMeans(featuresCol='topic_vector_fix_dim').setK(15).setSeed(1)
model = kmeans.fit(LDA)

root
|-- Id: string (nullable = true)
|-- topic_vector_fix_dim: array (nullable = true)
| |-- element: double (containsNull = true)

IllegalArgumentException: 'requirement failed: Column topic_vector_fix_dim must be of type equal to one of the following types: [struct < type:tinyint,size:int,indices:array < int >,values:array < double > >, array < double >, array < float > ] but was actually of type array < double > .'

I am confused - it does not like my array <double>, but says that it may be the input.
Each entry of the topic_vector_fix_dim is a 1d array of floats

2 Answers 2

7

containsNull of the features column should be set to False:

new_schema = ArrayType(DoubleType(), containsNull=False)
udf_foo = udf(lambda x:x, new_schema)
LDA = LDA.withColumn("topic_vector_fix_dim",udf_foo("topic_vector_fix_dim"))

After that everything works.

Sign up to request clarification or add additional context in comments.

Comments

2

The containsNull answer didn't work for me, but this did:

vectorAssembler = VectorAssembler(inputCols = ["x1", "x2", "x3"], outputCol = "features")
df = vectorAssembler.transform(df)
df = df.select(['features', 'Y'])

2 Comments

In the question the input feature is a single column already. The question is not about converting multiple columns to a single one.
@ArturSokolovsky But It did actually solve the problem. The only thing that makes sense is that the library doesn't explicit it, but it internally only recognize VectorAssembler's arrays when training Estimators.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.