0

I need to parallelize the data frame in SparkR while delegating data to Spark workers.

Code snippet :

parallelRDD <-SparkR:::parallelize(sc,dataFrame)

It prints the following information on the console.

context.R: A data frame is parallelized by columns.

Each row is atomic for the data processing. I have transposed the data frame which results into thousands of the columns. Now each column is atomic for data processing. But delegating single column to spark worker does not seems good strategy as there is not evident performance gain.

Is it possible to parallelize a collection of rows so that those data rows can be processed on spark worker?

1 Answer 1

1

All you need is something like this:

createDataFrame(sqlContext, dataFrame) %>% SparkR:::map(identity) 

Disclaimer: I don't encourage using internal API. Please be sure to read SPARK-7230 to understand why RDD API hasn't been included in the first official release of SparkR.

Sign up to request clarification or add additional context in comments.

1 Comment

Also understand a private API is generally undocumented, unsupported and can change or disappear at any time. You are building on sand. And processing a row at a time in R is not a high performance pattern. You may want to try mapPartition with a data frame function. But that's now private too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.