Spark SQL createDataFrame() raising OutOfMemory exception

Question

Does it create the whole dataFrame in Memory?

How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?

This is a pretty good candidate for closing. What is your setup? What is the configuration? What is the exact error you get? — zero323
– zero323, Commented Jul 17, 2015 at 15:10

rake · Accepted Answer · 2015-07-20 01:22:24Z

To persist it for later queries:

val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
          coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()

This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel docs for other persistence options, such as MEMORY_AND_DISK_SER, which will persist to both memory and disk.

In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.

Collectives™ on Stack Overflow

Spark SQL createDataFrame() raising OutOfMemory exception

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest