Pyspark with Elasticsearch

Question

I'm using Pyspark with Elasticsearch. I've noticed that when you create an RDD, it doesn't get executed prior to any collecting, counting or any other 'final' operation.

Is there away to execute and cache the transformed RDD as I use the transformed RDD's result for other things as well.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset. No other way around. — eliasah
– eliasah, Commented Oct 17, 2015 at 13:34
You can perform a count after caching if you want, but I don't see a purpose in doing that. — eliasah
– eliasah, Commented Oct 17, 2015 at 13:35

eliasah · Accepted Answer · 2015-10-17 20:25:56Z

Like I said in the comment section,

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

No other way around.

Why is it lazy?

Functional programming's lazy evaluation benefits:

Performance increases by avoiding needless calculations, and error conditions in evaluating compound expressions
The ability to construct potentially infinite data structures
The ability to define control structures as abstractions instead of primitives

Note: Most of the new functional programming languages are lazy (e.g Haskell, Scala). Even thought you are using Python, Spark is written in Scala.

Nevertheless if you want to compute your RDD after each RDD defintion, you can perform a count action after caching if you want, but I don't see a purpose in doing that. You'll eventually get the RDD when needed.

Collectives™ on Stack Overflow

Pyspark with Elasticsearch

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related