Why spark.ml don't implement any of spark.mllib algorithms?

Question

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:

spark.mllib, built on top of RDDs.
spark.ml, built on top of Dataframes.

According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.

The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.

If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?

What's the missing point here?

This question is in fact interesting and as a matter of fact, it's related to the one mentioned by Alberto. You can find your answer within @zero323 's answer. — eliasah
– eliasah, Commented Oct 20, 2015 at 13:09
Late to the party, just wanted to add that though the main Spark ML online documentation doesn't mention it, you can find NaiveBayes within the API doc and you can certainly use it. I am. — tadamhicks
– tadamhicks, Commented Aug 23, 2016 at 16:07

Community · Accepted Answer · 2017-05-23 10:29:30Z

15

Spark 2.0.0

Currently Spark moves strongly towards DataFrame API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.

See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Spark < 2.0.0

I guess that the main missing point is that spark.ml algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS use RDDs internally).

Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.

Majority of the ML algorithms require efficient linear algebra library not a tabular processing. Using cost based optimizer for linear algebra could be an interesting addition (I think that flink already has one) but it looks like for now there is nothing to gain here.
DataFrames API gives you very little control over the data. ~~You cannot use partitioner~~*, you cannot access multiple records at the time (I mean a whole partition for example), you're limited to a relatively small set of types and operations, you cannot use mutable data structures and so on.
Catalyst applies local optimizations. If you pass a SQL query / DSL expression it can analyze it, reorder, apply early projections. All of that is that great but typical scalable algorithms require iterative processing. So what you really want to optimize is a whole workflow and DataFrames alone are not faster than plain RDDs and depending on an operation can be actually slower.
Iterative processing in Spark, especially with joins, requires a fine graded control over the number of partitions, otherwise weird things happen. DataFrames give you no control over partitioning. Also, DataFrame / Dataset ~~don't provide native checkpoint capabilities~~ (fixed in Spark 2.1) which makes iterative processing almost impossible without ugly hacks
Ignoring low level implementation details some groups of algorithms, like FPM, don't fit very well into a model defined by ML pipelines.
Many optimizations are limited to native types, not UDT extensions like VectorUDT.

There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.

Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:

The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.

* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning

edited May 23, 2017 at 10:29

CommunityBot

11 silver badge

answered Oct 20, 2015 at 13:59

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

max Over a year ago

Correct me if I'm wrong, but I think even if ml is used solely as a wrapper around mllib, it provides a large benefit for python users because ml will use the Scala version of mllib instead of the python version available through the pyspark.mllib API. This removes a large overhead of moving data between JVM and Python interpreter.

zero323 Over a year ago

@Max If you stat with data loaded into Java objects and never move it back then yes. Otherwise it won't be different. There are also different kinds of "moving data" between VMs. So I am not sure if I understand the question.

max Over a year ago

Yes, I meant if your input / ouput is going to be on disk, and you use DataFrame API for disk I/O. As for movement between VMs, I just realized there's no need to serialize/deserialize data in the pyspark.mllib scenario (since the bulk of the data will forever remain in python VM as numpy arrays etc. and won't need to be operated on in the JVM). So I guess the overhead of using python RDDs vs Scala RDDs is limited to the issues you described here - still it's not completely trivial I think?

zero323 Over a year ago

@max Well... There is of course a benefit of keeping things in a single place. But all depends on the context. Typically distributed model training is way more expensive than any serde activity. But you won't optimize by making faster a small fraction of the code. So be reasonable here I guess.

Collectives™ on Stack Overflow

Why spark.ml don't implement any of spark.mllib algorithms?

What's the missing point here?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

What's the missing point here?

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related