0

I've always read that Scala is much faster than pyspark for many operations, but recently I've read in a blog that since the release of Spark 2 the performance differences are much lower.

Is this maybe due to the Dataframe introduction? Does that mean that operations on dataframe take the same time with Scala and pyspark?

Does exist a detailed and recent performance report about Scala/pyspark differences?

3
  • 1
    I'm not aware of a recent benchmark study about performance differences between scala spark and pyspark. Nevertheless, the DataFrame and Dataset APIs are built on top of the Spark SQL engine which uses Catalyst to generate an optimized logical and physical query plan. Across R, Java, Scala, or Python DataFrame/Dataset APIs, all relation type queries undergo the same code optimizer, providing the space and speed efficiency. Commented Oct 27, 2017 at 13:32
  • 1
    But the problem isn't here. Apache Spark didn't gain any performance on RDDs since the last benchmark done. And sometimes you'll be need to resort back to RDDs if you need more control. Scala out performs Python here. Commented Oct 27, 2017 at 13:33
  • 1
    This point that I have discussed in my earlier comments is just one part of what can be different. Unfortunately this question remains off-topic for being broad and I'm voting to close it ! Commented Oct 27, 2017 at 13:36

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.