Is it possible to map a discontiuous data on disk to an array with python?

Question

I want to map a big fortran record (12G) on hard disk to a numpy array. (Mapping instead of loading for saving memory.)

The data stored in fortran record is not continuous as it is divided by record markers. The record structure is as "marker, data, marker, data,..., data, marker". The length of data regions and markers are known.

The length of data between markers is not multiple of 4 bytes, otherwise I can map each data region to an array.

The first marker can be skipped by setting offset in memmap, is it possible to skip other markers and map the data to an array?

Apology for possible ambiguous expression and thanks for any solution or suggestion.

Edited May 15

These are fortran unformatted files. The data stored in record is a (1024^3)*3 float32 array (12Gb).

The record layout of variable-length records that are greater than 2 gigabytes is shown below:

data structure

(For details see here -> the section [Record Types] -> [Variable-Length Records].)

In my case, except the last one, each subrecord has a length of 2147483639 bytes and separated by 8 bytes (as you see in the figure above, a end marker of the previous subrecord and a begin marker of the following one, 8 bytes in total ) .

We can see the first subrecord ends with the first 3 bytes of certain float number and the second subrecord begins with the rest 1 byte as 2147483639 mod 4 =3.

Can you give us a bit more details about the data structure? Based on what I think you're saying, you have variable-length arrays between your markers? How are they packed (e.g. float, int8, int16, whatever)? — Joe Kington
– Joe Kington, Commented May 15, 2013 at 12:01
Thanks for attention and sorry for lack of details. More imformation is added. I'm trying h5py as suggested by Castro. — Syrtis Major
– Syrtis Major, Commented May 15, 2013 at 13:46

Saullo G. P. Castro · Accepted Answer · 2024-11-08 09:02:10Z

6

It is possible using numpy.memmap:

offset = 0
data1 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=0, shape=(size1))
offset += size1*byte_size
data2 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size2))
offset += size1*byte_size
data3 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size3))

You need to set the byte_size according to the data type. For example:

int32 requires byte_size=32/8
int16 byte_size=16/8
and so forth...

If the data type is constant for the entire array, you can load the data in a 2D array like:

shape = (total_length/size,size)
data = np.memmap('tmp', dtype='i', mode='r+', order='F', shape=shape)

You can change the memmap object as long as you want. It is even possible to make arrays sharing the same elements, in which case the changes made in the shared elements are perceived by all corresponding arrays.

Other references:

edited Nov 8, 2024 at 9:02

answered May 16, 2013 at 21:20

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Syrtis Major Over a year ago

I can access the 12GB file with memmap without any error. However, two problem remain. The first is the endian. order='F' is for 2D (or higher) array storing order not for the endian, so I have to do extra endian switch. The second the markers is mixed with the data， I have no idea to pick out the markers. Maybe my discription about the question is not clear.

Syrtis Major Over a year ago

Or I can use shape and offset to read the first subrecord of the file, question remains -- how can I put several subrecord together? I'm sorry for the poor expression in English.

Syrtis Major Over a year ago

Endian problem is solved, just set dtype with '>' i.e. test = np.memmap(file_path, dtype='>i', mode='r', order='F')

Saullo G. P. Castro Over a year ago

@substructure if you consider this solution satisfactory already, you can toggle it as accepted

Syrtis Major Over a year ago

Many thanks for you help. I'm afraid the main question remains -- is it possible to map the data (markers excluded) to an array. Now I can map the whole file to an array but the data and the markers are still interlaced. It's not convenient to do indexing for the pure data.

|

Collectives™ on Stack Overflow

Is it possible to map a discontiuous data on disk to an array with python?

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related