3

I understand that in order to use ml.clustering Kmeans algorithm (actually any ml algos?) with dataframe, I need to have my dataframe in a certain shape: (id, vector[]), or something like that. How to I apply the right transformation to convert a regular table (stored in df) to the desired structure? This is my df:

from pyspark import SparkConf
from pyspark import SparkContext


conf = SparkConf()
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
#-----------------------------
#creating DF:
l = [('user1', 2,1,4),('user2',3,5,6)]
temp_df = spark.createDataFrame(l)
temp_df.show()

+-----+---+---+---+
|   _1| _2| _3| _4|
+-----+---+---+---+
|user1|  2|  1|  4|
|user2|  3|  5|  6|
+-----+---+---+---+

I want to use:

from pyspark.ml.clustering import KMeans
kmean = KMeans().setK(2).setSeed(1)
model = kmean.fit(temp_df)

and I get: IllegalArgumentException: u'Field "features" does not exist.'

Thanks,

1 Answer 1

5

KMeans require an input column of vector type which should be, if not configured otherwise, named features. You should use VectorAssembler to combine the features.

Please consult:

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.