Apache Spark Task not Serializable

Question

I realize this question has been asked before, but I think my failure is due to a different reason.

            List<Tuple2<String, Integer>> results = results.collect();
            for (int i=0; i<results.size(); i++) {
                System.out.println(results.get(0)._1);
            }


Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: tools.MAStreamProcessor$1 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at

I have a simple 'map/reduce' program in Spark. The above lines take the results of the reduce step and loop through each resultant element. If I comment them out, then I get no errors. I stayed away from using 'forEach' or the concise for () thinking that the underlying generated produce elements that aren't serializable. I've gotten it down to a simple for loop and so wonder why I am still running into this error.

Thanks, Ranjit

Daniel Darabos · Accepted Answer · 2015-04-07 17:10:01Z

8

Use the -Dsun.io.serialization.extendedDebugInfo=true flag to turn on serialization debug logging. It will tell you what exactly it's unable to serialize.

The answer will have nothing to do with the lines you pasted. The collect is not the source of the problem, it's just what triggers the computation of the RDD. If you don't compute the RDD, nothing gets sent to the executors. So the accidental inclusion of something unserializable in an earlier step causes no problems without collect.

answered Apr 7, 2015 at 17:10

Daniel Darabos

27.6k10 gold badges108 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ranjit Iyer Over a year ago

Thanks for the tip. I did get more debug information but it didn't exactly point at the Object that wasn't serializable. I was seeing information listed like '-Object blah, -Field blah, -Object blah, etc'. Ultimately it turned out the culprit was a JSONObject instantiated inside the lambda function. When I moved it into a static function and called that function instead to get my JSON processing, it resolved the Serialization error. Thanks much for you help!

KrazyGautam Over a year ago

for a overall understanding of Spark serialization stackoverflow.com/questions/40818001/…

Collectives™ on Stack Overflow

Apache Spark Task not Serializable

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related