3

When I run my spark python code as below:

import pyspark
conf = (pyspark.SparkConf()
     .setMaster("local")
     .setAppName("My app")
     .set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf)        #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)

The file is about 20G and my computer have 8G ram, when I run the program in standalone mode, it raises the OutOfMemoryError:

Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
    at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:681)
    at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:666)

Is spark unable to deal with file larger than my ram? Could you tell me how to fix it?

1

1 Answer 1

5

Spark can handle some case. But you are using take to force Spark to fetch all of the data to an array(in memory). In such case, you should store them to files, like using saveAsTextFile.

If you are interested in looking at some of data, you can use sample or takeSample.

Sign up to request clarification or add additional context in comments.

3 Comments

If I want to use mlib to train all the features samples in the file how can I send all samples to train? Thanks
Actually, mlib has many algorithms. Which one do you want to use?
For example, I want to train the data with Logistic Regression. The file has several columns data and each column is a feature. How can I train all the samples in a feature into Logistic Regression?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.