1

I've learned briefly how to use RDDs to build ML models but in the past i've typically built my ML models using dataframes. I know spark.ml is the DataFrame API for spark machine learning but I haven't been able to find examples on how to utilize it.

My question is can you provide an example of how Dataframes can be used to build a spark machine learning model?

P.S. Sorry if this question is not appropriate, wasn't sure where to post this.

1 Answer 1

2

Here's a quick example I quickly whipped up just now.

import pyspark.ml                as ml
import pyspark.ml.feature        as ft
import pyspark.ml.classification as cl

data = sc.parallelize([
     (1, 'two',  3.4, 0)
    ,(2, 'four', 9.1, 1)
    ,(3, 'one',  2.1, 0)
    ,(4, 'five', 2.6, 0)
]).toDF(['id', 'counter', 'continuous', 'result'])

si  = ft.StringIndexer(inputCol='counter', outputCol='counter_idx')
ohe = ft.OneHotEncoder(inputCol=si.getOutputCol(), outputCol='counter_enc')
va  = ft.VectorAssembler(inputCols=['counter_enc', 'continuous'], outputCol='features')

lr  = cl.LogisticRegression(maxIter=5, featuresCol='features', labelCol='result')

pip = ml.Pipeline(stages=[si, ohe, va, lr])
pip.fit(data).transform(data).select(data.columns+['probability', 'prediction']).show()

You can also check the notebooks to Denny's and my book: https://github.com/drabastomek/learningPySpark/blob/master/Chapter06/LearningPySpark_Chapter06.ipynb

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.