1

Does it create the whole dataFrame in Memory?

How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?

1
  • 2
    This is a pretty good candidate for closing. What is your setup? What is the configuration? What is the exact error you get? Commented Jul 17, 2015 at 15:10

1 Answer 1

1

To persist it for later queries:

val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
          coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()

This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel docs for other persistence options, such as MEMORY_AND_DISK_SER, which will persist to both memory and disk.

In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.