Currently, I have a few 50 GB SAS data files (sas7bdat) and I would like to switch previous SAS code to an open source tool like R or Python. The biggest issue is how to deal with those giant files... I tried to export one 50 GB file into a CSV file then used fread in R to load it. However, it crushed during the file loading. So I am wondering what are the best way to handle this issue? Thanks in advance!
-
3Do you have sufficient RAM to keep such a big object in memory? If not, you need to look at R packages for out-of-memory data. See this task view: cran.r-project.org/web/views/HighPerformanceComputing.htmlRoland– Roland2015-11-10 17:49:32 +00:00Commented Nov 10, 2015 at 17:49
-
@Roland I only have 16GB and yes I am afraid not enough to load everything into memory...TTT– TTT2015-11-10 17:56:34 +00:00Commented Nov 10, 2015 at 17:56
-
Consider putting the data into a database like SQLiteCarl– Carl2015-11-10 17:57:08 +00:00Commented Nov 10, 2015 at 17:57
Add a comment
|
1 Answer
First some things to take into consideration:
- Uncompressed SAS files are huge. 50GB is probably <10GB as a CSV.
- R is a memory hog. You won't be able to do this with vanilla R.
- Python is actually quite efficient with memory when handling lists (i.e., records), it might just fit/work depending on what you want to do.
- Neither provide out-of-the-box parallelism, it needs an R package/python module. So by default, it will be slow even if it fits memory.
- You can to some extent evade the everything-in-memory issue in Python by making sure you use iterables everywhere instead of turning everything everywhere all the time into lists.
But a convenient solution for you would be to make use of Python together with PySpark (or R with SparkR, but the former is more mature at the moment):
- Not everything needs to be in memory at once
- You get parallelism out-of-the-box
- If you are 'one of those people' using proc sql everywhere, you can leverage Spark SQL for easy re-use of your work.
Have a look at the the project: https://spark.apache.org
1 Comment
TTT
Thanks so much for the resourceful reply! I will look into PySpark. In addition, the converted CSV file is not small ~30 GB ...