I need to parallelize the data frame in SparkR while delegating data to Spark workers.
Code snippet :
parallelRDD <-SparkR:::parallelize(sc,dataFrame)
It prints the following information on the console.
context.R: A data frame is parallelized by columns.
Each row is atomic for the data processing. I have transposed the data frame which results into thousands of the columns. Now each column is atomic for data processing. But delegating single column to spark worker does not seems good strategy as there is not evident performance gain.
Is it possible to parallelize a collection of rows so that those data rows can be processed on spark worker?