2

I would like to write Numpy arrays with shape (3, 225, 400) into a binary file.

These arrays are basically generated by using a screen buffer, and each screen has a label. My goal is to save each screen with its label.

numpy.save receives only two arguments: file pointer and array to be saved. The only option seems to be appending labels to arrays as follows:

with open(file, 'wb') as f:
   np.save(f, np.append(buffer, [label]) )

However, I would not prefer this. Another approach might be saving only the array and then writing " \t label " like regular binary writing:

with open(file, 'wb') as f:
   np.save(f, buffer)
   f.write("\t" + label)

I am not sure whether np.save moves the file pointer to new line after saving.

Considering the fact that I will save hundreds of thousands of array-label pairs in a high frequency, what would you suggest in terms of efficiency?

3
  • What's the dtype of buffer? Probably some numeric. What is the nature of label? Make sure you look at np.append(buffer,[label]) before you save it. Check the shape and dtype, as well as some values. There isn't a way of adding a label attribute to an array, either before or during np.save. It is probably best to use the file name as 'label', or have a separate file that pairs filenames and labels. Or look into using HDF5 files (h5py), which can save multiple arrays, along with 'label' attributes. Commented Dec 10, 2020 at 23:35
  • Yes, the array is full of numeric values. The type of label depends on the situation. Let's assume as boolean for now. Creating separate files might be inefficient but I will have a look at the HDF5 files, thanks! Commented Dec 11, 2020 at 11:58
  • A scalar boolean or boolean array? If array, what shape? Commented Dec 11, 2020 at 15:51

2 Answers 2

2

One option is to save to a numpy (NPZ) file. I have included an example below. np.savez and np.savez_compressed allow one to save multiple arrays to one file.

import numpy as np

# Create fake data.
rng = np.random.RandomState(0)
buffer = rng.normal(size=(3, 225, 400))
label = "this is the label"

# Save. Can use np.savez here instead.
np.savez_compressed("output.npz", buffer=buffer, label=label)

# Load.
npzfile = np.load("output.npz")

np.testing.assert_equal(npzfile["buffer"], buffer)
np.testing.assert_equal(npzfile["label"], label)

Another option is to use HDF5 using h5py. The organization of an HDF5 file is similar to a filesystem (root is / and datasets can be created with names like /data/buffers/dataset1). One way of organizing the buffers and labels is to create a dataset for each buffer and set a label attribute on it.

import h5py
import numpy as np

# Create fake data.
rng = np.random.RandomState(0)
buffer = rng.normal(size=(3, 225, 400))
label = "this is the label"

this_dataset = "/buffers/0"

# Save to HDF5.
with h5py.File("output.h5", "w") as f:
    f.create_dataset(this_dataset, data=buffer, compression="lzf")
    f[this_dataset].attrs.create("label", label)

# Load.
with h5py.File("output.h5", "r") as f:
    loaded_buffer = f[this_dataset]
    loaded_label = f[this_dataset].attrs["label"]
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your detailed answer jakub. As far as I understood, in your first approach, we need to create different .npz files for each buffer-label pais. Am I right? If so, this wouldn't be appropriate for me. Can I store multiple buffer-label pairs in the same "output.npz" file? I guess the same situation happens also in the HDF5 system.
You would probably need to create multiple npz files. But you do not need to create multiple hdf5 files! That's one of the great benefits of hdf5. You can save different buffers to different datasets. Are you planning to write in parallel? If so, then you probably need different files. If you are writing sequentially, it seems to me like hdf5 would be a good option. Another benefit of hdf5 is that you do not need to load all of the data into memory when reading. You can read individual datasets, or even parts of datasets.
0

If you have a dict like

mydict = { "label0" : array0, "label1" : array1 ... }

just

save = np.savez( "my.npz", **mydict )
    # == np.savez( "my.npz", label0=array0, label1=array1 ... )

load = np.load( "my.npz" )  # like `mydict`
print( "my.npz labels:" )
print( "\n".join( load.keys() )
array0 = load["label0"]
...

Notes:
Don't compress; do pay attention to the array formats, e.g. np.uint8.
Always add mydict["runinfo"] = "who what when".
For a summary of xx.npz, see the little gist npzinfo.
np.load( ... mmap_mode ) ?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.