I'm trying to prepare a Library (written in Java) to run on Apache-Spark. Since the Library has hundreds of classes and still in active development stage, I do not want to serialize all of them one by one. Instead I searched for another method and found this, but again it does not resolve the serialization issue.
here is the code sample:
List<Integer> data = Arrays.asList(1,2,3,4,5,6);
JavaRDD<Integer> distData = sc.parallelize(data);
JavaRDD<Year4D> years = distData.map(y -> func.call(y));
List<Year4D> years1 = years.collect();
where func is a Function that generates 4 digit Year with using Year4D;
private static Function<Integer, Year4D> func = new Function<Integer, Year4D>() {
public Year4D call(Integer arg0) throws Exception {
return new Year4D(arg0);
}};
and Year4D does not implement Serializable;
public class Year4D{
private int year = 0;
public Year4D(int year) {
if (year < 1000) year += (year < 70) ? 2000 : 1900;
this.year = year;
}
public String toString() {
return "Year4D [year=" + year + "]";
}}
Which produce "object not serializable" exception for the Year4D:
Job aborted due to stage failure: Task 6.0 in stage 0.0 (TID 6) had a not serializable result...
by the way, if I replace the Command Action collect() into foreach(func) it works,
So, my question is why collect() not works?
And If this approach is not good, what is the best practice to handle a Java Library which contains that much tons of complex classes?
PS. @Tzach said that Year4D isn't wrapped correctly so actually it is not serialized, then what is the correct implementation?