how to parallelize the data frame by number of rows in SparkR?

Question

I need to parallelize the data frame in SparkR while delegating data to Spark workers.

Code snippet :

parallelRDD <-SparkR:::parallelize(sc,dataFrame)

It prints the following information on the console.

context.R: A data frame is parallelized by columns.

Each row is atomic for the data processing. I have transposed the data frame which results into thousands of the columns. Now each column is atomic for data processing. But delegating single column to spark worker does not seems good strategy as there is not evident performance gain.

Is it possible to parallelize a collection of rows so that those data rows can be processed on spark worker?

zero323 · Accepted Answer · 2016-01-08 13:13:02Z

1

All you need is something like this:

createDataFrame(sqlContext, dataFrame) %>% SparkR:::map(identity)

Disclaimer: I don't encourage using internal API. Please be sure to read SPARK-7230 to understand why RDD API hasn't been included in the first official release of SparkR.

edited Jan 8, 2016 at 13:13

answered Jan 8, 2016 at 13:03

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

piccolbo Over a year ago

Also understand a private API is generally undocumented, unsupported and can change or disappear at any time. You are building on sand. And processing a row at a time in R is not a high performance pattern. You may want to try mapPartition with a data frame function. But that's now private too.

Collectives™ on Stack Overflow

how to parallelize the data frame by number of rows in SparkR?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related