3

I was trying to use the pyspark.ml.evaluation Binary classification metric like below

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
print evaluator.evaluate(predictions)

My Predictions data frame looks like this:

predictions.select('rating','prediction')
predictions.show()
+------+------------+
|rating|  prediction|
+------+------------+
|     1|  0.14829934|
|     1|-0.017862909|
|     1|   0.4951505|
|     1|0.0074382657|
|     1|-0.002562912|
|     1|   0.0208337|
|     1| 0.049362548|
|     1|  0.09693333|
|     1|  0.17998546|
|     1| 0.019649783|
|     1| 0.031353004|
|     1|  0.03657037|
|     1|  0.23280995|
|     1| 0.033190556|
|     1|  0.35569906|
|     1| 0.030974165|
|     1|   0.1422375|
|     1|  0.19786166|
|     1|  0.07740938|
|     1|  0.33970386|
+------+------------+
only showing top 20 rows

The datatype of each column is as follows:

predictions.printSchema()
root
 |-- rating: integer (nullable = true)
 |-- prediction: float (nullable = true)

Now I get an error with above Ml code saying prediction column is Float and expected a VectorUDT.

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     51                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     52             if s.startswith('java.lang.IllegalArgumentException: '):
---> 53                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     54             raise
     55     return deco

IllegalArgumentException: u'requirement failed: Column prediction must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually FloatType.'

So I thought of converting the predictions column from float to VectorUDT as below:

Applying the schema to the dataframe to convert the float column type to VectorUDT

from pyspark.sql.types import IntegerType, StructType,StructField

schema = StructType([
    StructField("rating", IntegerType, True),
    StructField("prediction", VectorUDT(), True)
])


predictions_dtype=sqlContext.createDataFrame(prediction,schema)

But Now I get this error.

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-30-8fce6c4bbeb4> in <module>()
      4 
      5 schema = StructType([
----> 6     StructField("rating", IntegerType, True),
      7     StructField("prediction", VectorUDT(), True)
      8 ])

/Users/i854319/spark/python/pyspark/sql/types.pyc in __init__(self, name, dataType, nullable, metadata)
    401         False
    402         """
--> 403         assert isinstance(dataType, DataType), "dataType should be DataType"
    404         if not isinstance(name, str):
    405             name = name.encode('utf-8')

AssertionError: dataType should be DataType

It takes so much time to run an ml algo in spark libraries with so many weird errors. Even I tried Mllib with RDD data. That is giving the ValueError: Null pointer exception.

Please advise.

1
  • mind telling why it has been down voted? Commented Nov 3, 2016 at 18:37

1 Answer 1

3

Try:

as_prob = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())

df.withColumn("prediction", as_prob(df["prediction"]))

Source: Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

Sign up to request clarification or add additional context in comments.

1 Comment

Cool. This worked. Quick question though. pyspark.ml.BinaryClassificationEvaluator should take the original label and predicted value as input. Original Label should be int sine it is 1 or 0. MY original rating was int but then it gave an error that it needs the label col to be double?. I am not sure why was that required. It coverts the label to float values too and then compares?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.