2

Currently, I have a few 50 GB SAS data files (sas7bdat) and I would like to switch previous SAS code to an open source tool like R or Python. The biggest issue is how to deal with those giant files... I tried to export one 50 GB file into a CSV file then used fread in R to load it. However, it crushed during the file loading. So I am wondering what are the best way to handle this issue? Thanks in advance!

3
  • 3
    Do you have sufficient RAM to keep such a big object in memory? If not, you need to look at R packages for out-of-memory data. See this task view: cran.r-project.org/web/views/HighPerformanceComputing.html Commented Nov 10, 2015 at 17:49
  • @Roland I only have 16GB and yes I am afraid not enough to load everything into memory... Commented Nov 10, 2015 at 17:56
  • Consider putting the data into a database like SQLite Commented Nov 10, 2015 at 17:57

1 Answer 1

2

First some things to take into consideration:

  • Uncompressed SAS files are huge. 50GB is probably <10GB as a CSV.
  • R is a memory hog. You won't be able to do this with vanilla R.
  • Python is actually quite efficient with memory when handling lists (i.e., records), it might just fit/work depending on what you want to do.
  • Neither provide out-of-the-box parallelism, it needs an R package/python module. So by default, it will be slow even if it fits memory.
  • You can to some extent evade the everything-in-memory issue in Python by making sure you use iterables everywhere instead of turning everything everywhere all the time into lists.

But a convenient solution for you would be to make use of Python together with PySpark (or R with SparkR, but the former is more mature at the moment):

  • Not everything needs to be in memory at once
  • You get parallelism out-of-the-box
  • If you are 'one of those people' using proc sql everywhere, you can leverage Spark SQL for easy re-use of your work.

Have a look at the the project: https://spark.apache.org

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much for the resourceful reply! I will look into PySpark. In addition, the converted CSV file is not small ~30 GB ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.