Pyspark: Using dataframe in ml algorithms

Question

I understand that in order to use ml.clustering Kmeans algorithm (actually any ml algos?) with dataframe, I need to have my dataframe in a certain shape: (id, vector[]), or something like that. How to I apply the right transformation to convert a regular table (stored in df) to the desired structure? This is my df:

from pyspark import SparkConf
from pyspark import SparkContext


conf = SparkConf()
sc = SparkContext(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
#-----------------------------
#creating DF:
l = [('user1', 2,1,4),('user2',3,5,6)]
temp_df = spark.createDataFrame(l)
temp_df.show()

+-----+---+---+---+
|   _1| _2| _3| _4|
+-----+---+---+---+
|user1|  2|  1|  4|
|user2|  3|  5|  6|
+-----+---+---+---+

I want to use:

from pyspark.ml.clustering import KMeans
kmean = KMeans().setK(2).setSeed(1)
model = kmean.fit(temp_df)

and I get: IllegalArgumentException: u'Field "features" does not exist.'

Thanks,

ed7fffd6 · Accepted Answer · 2016-11-28 10:19:57Z

5

KMeans require an input column of vector type which should be, if not configured otherwise, named features. You should use VectorAssembler to combine the features.

Please consult:

answered Nov 28, 2016 at 10:19

ed7fffd6

511 bronze badge

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark: Using dataframe in ml algorithms

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related