3

I am working on Databricks and want to utilize the MLlib package in Spark using Python. When I was using Scikit-learn previously, I would have a list of features, and another list of labels for the features. I would simply fit this using a decision tree classifier and predict.

Looking at the documentation, I am a bit lost on how to do something similar on PySpark: https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

I believe in order to use MLlib, I need to extract columns from a dataframe to use as features and labels. So in doing this, I was wondering how to create a new empty dataframe, and then appending two columns to this, one of the list of features, and the other for the list of labels.

My list of features (ex: [2, 0, 0, 1]) is called 'ml_list' and my list of labels (ex: [1] or [0]) is called 'labels'.

Here is my code so far, not sure if I am on the right path. Both my features as well as my labels are binary, so I chose IntegerType():

field = [StructField(“ml_list”,IntegerType(), 
True),StructField(“Labels”, IntegerType(), True)]

schema = StructType(field)
df_date = sqlContext.createDataFrame(sc.emptyRDD(), schema)

Any help would be great, as I am quite new to Spark.

2 Answers 2

2

Alternatively:

from pyspark.ml.linalg import Vectors

dd = [(labels[i][0], Vectors.dense(features[i])) for i in range(len(labels))]
df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
Sign up to request clarification or add additional context in comments.

Comments

2

If you have:

labels = [[0], [1], [0]]

and

features = [[2, 0, 0, 1], [0, 0, 0, 1], [0, 2, 0, 1]]

you can:

from pyspark.ml.linalg import Vectors

sc.parallelize(zip(labels, features)).map(lambda lp: (float(lp[0][0]), Vectors.dense(lp[1]))).toDF(["label", "features"])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.