Efficient way to import large data file, Python

Question

I'm working on a project that gets data from two NetCDF files, each of which is 521.8 MB. Admittedly, these are fairly large files. I am working on a MacBook Pro which has 4 GB of memory on it, but the computer is approximately 4 years old. Code is written in Python.

The files contains a year's worth of weather data across the Earth. It is a 4D array that contains the time (length 1460), altitude (length 17), latitude (length 73), and longitude (length 144). I only need certain portions of that information at a time. Specifically, I need all of the time, but only one altitude level, and only a particular region of latitude and longitude (20x44).

I had code that gathered all of this data from both files, identified only the data I needed, performed calculations, and output the data into a text file. Once done with that year, it looped through 63 years of data, which is 126 files of equivalent size. Now, the code says it runs out of memory right at the beginning of the process. The relevant code seems to be:

from mpl_toolkits.basemap.pupynere import NetCDFFile

#Create the file name for the input data.
ufile="Flow/uwnd."+str(time)+".nc"
vfile="Flow/vwnd."+str(time)+".nc"

#Get the data from that particular file.
uu=NetCDFFile(ufile)
vv=NetCDFFile(vfile)

#Save the values into an array (will be 4-dimentional)
uwnd_short=uu.variables['uwnd'][:]
vwnd_short=vv.variables['vwnd'][:]

So, the first section creates the name of the NetCDF files. The second section gets all the data from the NetCDF files. The third section takes the imported data and places it into 4D arrays. (This may not technically be an array because of how Python works with the data, but I have thought of it as such due to my C++ background. Apologies for lack of proper vocabulary.) Later on, I separate out the specific data I need from the 4D array and perform necessary calculations. The trouble is that this used to work, but now my computer runs out of memory while working on the vv=NetCDFFile(vfile) line.

Is there a possible memory leak somewhere? Is there a way to only get the specific range of data I need so I'm not bringing in the entire file? Is there a more efficient way to go from bringing the data in to sorting out the section of data I need to performing calculations with it?

Please can you provide a few lines of sample data from the files that you are importing if possible. — ChrisProsser
– ChrisProsser, Commented Aug 6, 2013 at 16:50
When I import one file, save it into an array, and then output the array, I get the following: [[[[ 4.10000610e+00 4.50001526e+00 4.80000305e+00 ..., 2.90000916e+00 3.30000305e+00 3.70001221e+00] [ 3.00001526e+00 3.50001526e+00 3.90000916e+00 ..., 1.60000610e+00 2.10000610e+00 2.50001526e+00] [ -9.99984741e-01 -6.99996948e-01 -3.99993896e-01 ..., -1.49998474e+00 -1.39999390e+00 -1.19999695e+00] ..., The numbers of course continue and I later use a scale and offset. — Stephen
– Stephen, Commented Aug 6, 2013 at 21:00
Thanks to my finally figuring out ncdump, the output looks like this:-16146, -16176, -16226, -16306, -16436, -16616, -16836, -17056, -17286, -17506, -17706, -17866, -17976, -18016, -17996, -17916, -17776, -17566, -17306, -17016, -16746, -16526, -16416, -16416, -16496, -16606, -16726, — Stephen
– Stephen, Commented Aug 7, 2013 at 1:35

chander · Accepted Answer · 2013-08-07 00:23:01Z

3

What you probably need to do is rechunk the files using nccopy, then process the chunks, since some of the variables seem to large to fit in memory. That or get more memory (or virtual memory.)

nccopy docs are here http://www.unidata.ucar.edu/software/netcdf/docs/guide_nccopy.html

edited Aug 7, 2013 at 0:23

answered Aug 6, 2013 at 16:49

chander

2,2073 gold badges18 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pzkpfw Over a year ago

"That or get more memory" — even if we assume that not enough memory is available for this, that shouldn't mean that the program will just crash right? Isn't this what pagefiles and /swap is for?

Stephen Over a year ago

I'm not sure I understand what the rechunking is doing. The length of the four dimensions in the original file are 1464, 17, 73, and 144. I attempt to rechunk it to 1464, 17, 50, and 50 and the new file is 750 MB (the original is 522 MB). When I rechunk it to 1464, 17, 20, and 20, the new file is 637 MB. This does not appear to be smaller or more efficient to me. Is there something I'm missing?

Stephen · Accepted Answer · 2014-08-09 18:30:15Z

0

For what it's worth, I did wind up having too much data on my computer and was running out of memory. I got my external hard drive to work, and removed a bunch of files. Then, I ended up figuring out how to use ncgen, ncdump, etc. I was able to get out of each large file only the data I needed and create a new file with only that data in it. This reduced my NetCDF files from 500MB to 5MB. That made the code much quicker to run as well.

answered Aug 9, 2014 at 18:30

Stephen

194 bronze badges

1 Comment

N1B4 Over a year ago

I often use ncks to hyperslab global files into latitudinal bands for memory-efficient read/write. nco.sourceforge.net/nco.html#ncks-netCDF-Kitchen-Sink

Collectives™ on Stack Overflow

Efficient way to import large data file, Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related