How to use Dataframes in pyspark machine learning?

Question

I've learned briefly how to use RDDs to build ML models but in the past i've typically built my ML models using dataframes. I know spark.ml is the DataFrame API for spark machine learning but I haven't been able to find examples on how to utilize it.

My question is can you provide an example of how Dataframes can be used to build a spark machine learning model?

P.S. Sorry if this question is not appropriate, wasn't sure where to post this.

TDrabas · Accepted Answer · 2018-08-09 04:23:45Z

Here's a quick example I quickly whipped up just now.

import pyspark.ml                as ml
import pyspark.ml.feature        as ft
import pyspark.ml.classification as cl

data = sc.parallelize([
     (1, 'two',  3.4, 0)
    ,(2, 'four', 9.1, 1)
    ,(3, 'one',  2.1, 0)
    ,(4, 'five', 2.6, 0)
]).toDF(['id', 'counter', 'continuous', 'result'])

si  = ft.StringIndexer(inputCol='counter', outputCol='counter_idx')
ohe = ft.OneHotEncoder(inputCol=si.getOutputCol(), outputCol='counter_enc')
va  = ft.VectorAssembler(inputCols=['counter_enc', 'continuous'], outputCol='features')

lr  = cl.LogisticRegression(maxIter=5, featuresCol='features', labelCol='result')

pip = ml.Pipeline(stages=[si, ohe, va, lr])
pip.fit(data).transform(data).select(data.columns+['probability', 'prediction']).show()

You can also check the notebooks to Denny's and my book: https://github.com/drabastomek/learningPySpark/blob/master/Chapter06/LearningPySpark_Chapter06.ipynb

Hope this helps.

Collectives™ on Stack Overflow

How to use Dataframes in pyspark machine learning?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related