I understand that in order to use ml.clustering Kmeans algorithm (actually any ml algos?) with dataframe, I need to have my dataframe in a certain shape: (id, vector[]), or something like that. How to I apply the right transformation to convert a regular table (stored in df) to the desired structure? This is my df:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
#-----------------------------
#creating DF:
l = [('user1', 2,1,4),('user2',3,5,6)]
temp_df = spark.createDataFrame(l)
temp_df.show()
+-----+---+---+---+
| _1| _2| _3| _4|
+-----+---+---+---+
|user1| 2| 1| 4|
|user2| 3| 5| 6|
+-----+---+---+---+
I want to use:
from pyspark.ml.clustering import KMeans
kmean = KMeans().setK(2).setSeed(1)
model = kmean.fit(temp_df)
and I get: IllegalArgumentException: u'Field "features" does not exist.'
Thanks,