3

When I try to serialize a model using MLeap using the following code:

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row

# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()

# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')

feature_assembler = VectorAssembler(inputCols=[string_indexer.getOutputCol()], outputCol="features")

feature_pipeline = [string_indexer, feature_assembler]

featurePipeline = Pipeline(stages=feature_pipeline)

fittedPipeline = featurePipeline.fit(df2)


# serialize the model:
fittedPipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip", fittedPipeline.transform(df2))

However I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-98a49e4cd023> in <module>()
----> 1 fittedPipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip", fittedPipeline.transform(df2))

/opt/anaconda2/envs/py345/lib/python3.4/site-packages/mleap/pyspark/spark_support.py in serializeToBundle(self, path, dataset)
     22 
     23 def serializeToBundle(self, path, dataset=None):
---> 24     serializer = SimpleSparkSerializer()
     25     serializer.serializeToBundle(self, path, dataset=dataset)
     26 

/opt/anaconda2/envs/py345/lib/python3.4/site-packages/mleap/pyspark/spark_support.py in __init__(self)
     37     def __init__(self):
     38         super(SimpleSparkSerializer, self).__init__()
---> 39         self._java_obj = _jvm().ml.combust.mleap.spark.SimpleSparkSerializer()
     40 
     41     def serializeToBundle(self, transformer, path, dataset):

TypeError: 'JavaPackage' object is not callable

Please assist?

2 Answers 2

4

I managed to fix this problem by downloading and pointing to the missing jar files on the spark submit script. For my case, I had installed MLeap 0.8.1 and was using Spark2 built on Scalar11, so I downloaded the following jar files from MvnRepository:

  • metrics-core-2.2.0
  • mleap-base_2.11-0.8.1
  • mleap-core_2.11-0.8.1
  • mleap-runtime_2.11-0.8.1
  • mleap-spark_2.11-0.8.1
  • mleap-spark-base_2.11-0.8.1
  • mleap-tensor_2.11-0.8.1

Then I also pointed to this jar files using the --jar flag on my spark submit file as follows (I also pointed to the maven repository using the --repository flag):

export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client --driver-memory 40g --num-executors 15 --executor-memory 30g --executor-cores 5 --packages ml.combust.mleap:mleap-runtime_2.11:0.8.1 --repositories http://YOUR MAVEN REPO/ --jars arpack_combined_all-0.1.jar,mleap-base_2.11-0.8.1.jar,mleap-core_2.11-0.8.1.jar,mleap-runtime_2.11-0.8.1.jar,mleap-spark_2.11-0.8.1.jar,mleap-spark-base_2.11-0.8.1.jar,mleap-tensor_2.11-0.8.1.jar pyspark-shell'
jupyter notebook --no-browser --ip=$(hostname -f)

-Source

Sign up to request clarification or add additional context in comments.

Comments

1

The answer of @Tshilidzi Madau is correct - what you need to do is to add mleap-spark jar into your spark classpath.

One option in pyspark is to set the spark.jars.packages config while creating the SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.19.0') \
    .config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all") \ # this exclude is needed as this lib seems not to be available in public maven repos
    .getOrCreate()

I tested it with Spark 3.0.3 and mleap 0.19.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.