How to create a PySpark dataframe from two lists?

Question

I am working on Databricks and want to utilize the MLlib package in Spark using Python. When I was using Scikit-learn previously, I would have a list of features, and another list of labels for the features. I would simply fit this using a decision tree classifier and predict.

Looking at the documentation, I am a bit lost on how to do something similar on PySpark: https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

I believe in order to use MLlib, I need to extract columns from a dataframe to use as features and labels. So in doing this, I was wondering how to create a new empty dataframe, and then appending two columns to this, one of the list of features, and the other for the list of labels.

My list of features (ex: [2, 0, 0, 1]) is called 'ml_list' and my list of labels (ex: [1] or [0]) is called 'labels'.

Here is my code so far, not sure if I am on the right path. Both my features as well as my labels are binary, so I chose IntegerType():

field = [StructField(“ml_list”,IntegerType(), 
True),StructField(“Labels”, IntegerType(), True)]

schema = StructType(field)
df_date = sqlContext.createDataFrame(sc.emptyRDD(), schema)

Any help would be great, as I am quite new to Spark.

desertnaut · Accepted Answer · 2017-08-01 20:00:28Z

2

Alternatively:

from pyspark.ml.linalg import Vectors

dd = [(labels[i][0], Vectors.dense(features[i])) for i in range(len(labels))]
df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])

answered Aug 1, 2017 at 20:00

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alper t. Turker · Accepted Answer · 2017-07-31 20:59:09Z

2

If you have:

labels = [[0], [1], [0]]

and

features = [[2, 0, 0, 1], [0, 0, 0, 1], [0, 2, 0, 1]]

you can:

from pyspark.ml.linalg import Vectors

sc.parallelize(zip(labels, features)).map(lambda lp: (float(lp[0][0]), Vectors.dense(lp[1]))).toDF(["label", "features"])

answered Jul 31, 2017 at 20:59

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Collectives™ on Stack Overflow

How to create a PySpark dataframe from two lists?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related