Loading large SAS data into R/Python

Question

Currently, I have a few 50 GB SAS data files (sas7bdat) and I would like to switch previous SAS code to an open source tool like R or Python. The biggest issue is how to deal with those giant files... I tried to export one 50 GB file into a CSV file then used fread in R to load it. However, it crushed during the file loading. So I am wondering what are the best way to handle this issue? Thanks in advance!

Do you have sufficient RAM to keep such a big object in memory? If not, you need to look at R packages for out-of-memory data. See this task view: cran.r-project.org/web/views/HighPerformanceComputing.html — Roland
– Roland, Commented Nov 10, 2015 at 17:49
@Roland I only have 16GB and yes I am afraid not enough to load everything into memory... — TTT
– TTT, Commented Nov 10, 2015 at 17:56

mvherweg · Accepted Answer · 2015-11-10 21:30:43Z

2

First some things to take into consideration:

Uncompressed SAS files are huge. 50GB is probably <10GB as a CSV.
R is a memory hog. You won't be able to do this with vanilla R.
Python is actually quite efficient with memory when handling lists (i.e., records), it might just fit/work depending on what you want to do.
Neither provide out-of-the-box parallelism, it needs an R package/python module. So by default, it will be slow even if it fits memory.
You can to some extent evade the everything-in-memory issue in Python by making sure you use iterables everywhere instead of turning everything everywhere all the time into lists.

But a convenient solution for you would be to make use of Python together with PySpark (or R with SparkR, but the former is more mature at the moment):

Not everything needs to be in memory at once
You get parallelism out-of-the-box
If you are 'one of those people' using proc sql everywhere, you can leverage Spark SQL for easy re-use of your work.

Have a look at the the project: https://spark.apache.org

answered Nov 10, 2015 at 21:30

mvherweg

1,2839 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

TTT Over a year ago

Thanks so much for the resourceful reply! I will look into PySpark. In addition, the converted CSV file is not small ~30 GB ...

Collectives™ on Stack Overflow

Loading large SAS data into R/Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related