I am working on Databricks and want to utilize the MLlib package in Spark using Python. When I was using Scikit-learn previously, I would have a list of features, and another list of labels for the features. I would simply fit this using a decision tree classifier and predict.
Looking at the documentation, I am a bit lost on how to do something similar on PySpark: https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
I believe in order to use MLlib, I need to extract columns from a dataframe to use as features and labels. So in doing this, I was wondering how to create a new empty dataframe, and then appending two columns to this, one of the list of features, and the other for the list of labels.
My list of features (ex: [2, 0, 0, 1]) is called 'ml_list' and my list of labels (ex: [1] or [0]) is called 'labels'.
Here is my code so far, not sure if I am on the right path. Both my features as well as my labels are binary, so I chose IntegerType():
field = [StructField(“ml_list”,IntegerType(),
True),StructField(“Labels”, IntegerType(), True)]
schema = StructType(field)
df_date = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Any help would be great, as I am quite new to Spark.