0

Assume Employee is a Java Class.

I have a JavaRDD<Employee[]> arrayOfEmpList, i.e, each RDD has an array of employees.

Out of this, I want to create a single list of employees, something like

JavaRDD<Employee>

This is what i tried: Created a List<Employee> empList = new ArrayList<Employee>();

then foreach RDD of Employee[]:

arrayOfEmpList.forEach(new VoidFunction<Employee[]>(){
public void call(Employee[] arg0){
   empList.addAll(Arrays.asList(arg0));
   System.out.println(empList.size()); //prints correct values incrementally
}
});

System.out.println(empList.size()); //gives 0

I am not able to get the size outside foreach loop.

Is there some other way to achieve this?

P.S: i want to have all employee records as separate RDD, so 1st employee list may contain 10 records, 2nd may contain 100 records, 3rd may contain 200 records. i want a final list of 330 records, which i can then parallelize and perform actions upon.

1 Answer 1

1

What you need is the flatMap transformation on your array. I'm first converting your employee array into a list:

JavaRDD<Employee> employeeRDD = arrayOfEmployeeList.flatMap(empArray -> Arrays.asList(empArray));

Check, perhaps the method has an overload that takes an array directly, not just a collection.

You can see this in the transformations section of the programming guide: http://spark.apache.org/docs/latest/programming-guide.html#transformations

JavaDocs: http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDDLike.html#flatMap(org.apache.spark.api.java.function.FlatMapFunction)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.