numpy.save to store 3D Numpy array together with a label

Question

I would like to write Numpy arrays with shape (3, 225, 400) into a binary file.

These arrays are basically generated by using a screen buffer, and each screen has a label. My goal is to save each screen with its label.

numpy.save receives only two arguments: file pointer and array to be saved. The only option seems to be appending labels to arrays as follows:

with open(file, 'wb') as f:
   np.save(f, np.append(buffer, [label]) )

However, I would not prefer this. Another approach might be saving only the array and then writing " \t label " like regular binary writing:

with open(file, 'wb') as f:
   np.save(f, buffer)
   f.write("\t" + label)

I am not sure whether np.save moves the file pointer to new line after saving.

Considering the fact that I will save hundreds of thousands of array-label pairs in a high frequency, what would you suggest in terms of efficiency?

What's the dtype of buffer? Probably some numeric. What is the nature of label? Make sure you look at np.append(buffer,[label]) before you save it. Check the shape and dtype, as well as some values. There isn't a way of adding a label attribute to an array, either before or during np.save. It is probably best to use the file name as 'label', or have a separate file that pairs filenames and labels. Or look into using HDF5 files (h5py), which can save multiple arrays, along with 'label' attributes. — hpaulj
– hpaulj, Commented Dec 10, 2020 at 23:35
Yes, the array is full of numeric values. The type of label depends on the situation. Let's assume as boolean for now. Creating separate files might be inefficient but I will have a look at the HDF5 files, thanks! — bbasaran
– bbasaran, Commented Dec 11, 2020 at 11:58

jkr · Accepted Answer · 2020-12-10 23:57:51Z

2

One option is to save to a numpy (NPZ) file. I have included an example below. np.savez and np.savez_compressed allow one to save multiple arrays to one file.

import numpy as np

# Create fake data.
rng = np.random.RandomState(0)
buffer = rng.normal(size=(3, 225, 400))
label = "this is the label"

# Save. Can use np.savez here instead.
np.savez_compressed("output.npz", buffer=buffer, label=label)

# Load.
npzfile = np.load("output.npz")

np.testing.assert_equal(npzfile["buffer"], buffer)
np.testing.assert_equal(npzfile["label"], label)

Another option is to use HDF5 using h5py. The organization of an HDF5 file is similar to a filesystem (root is / and datasets can be created with names like /data/buffers/dataset1). One way of organizing the buffers and labels is to create a dataset for each buffer and set a label attribute on it.

import h5py
import numpy as np

# Create fake data.
rng = np.random.RandomState(0)
buffer = rng.normal(size=(3, 225, 400))
label = "this is the label"

this_dataset = "/buffers/0"

# Save to HDF5.
with h5py.File("output.h5", "w") as f:
    f.create_dataset(this_dataset, data=buffer, compression="lzf")
    f[this_dataset].attrs.create("label", label)

# Load.
with h5py.File("output.h5", "r") as f:
    loaded_buffer = f[this_dataset]
    loaded_label = f[this_dataset].attrs["label"]

edited Dec 10, 2020 at 23:57

answered Dec 10, 2020 at 23:39

jkr

19.6k5 gold badges49 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

bbasaran Over a year ago

Thank you for your detailed answer jakub. As far as I understood, in your first approach, we need to create different .npz files for each buffer-label pais. Am I right? If so, this wouldn't be appropriate for me. Can I store multiple buffer-label pairs in the same "output.npz" file? I guess the same situation happens also in the HDF5 system.

jkr Over a year ago

You would probably need to create multiple npz files. But you do not need to create multiple hdf5 files! That's one of the great benefits of hdf5. You can save different buffers to different datasets. Are you planning to write in parallel? If so, then you probably need different files. If you are writing sequentially, it seems to me like hdf5 would be a good option. Another benefit of hdf5 is that you do not need to load all of the data into memory when reading. You can read individual datasets, or even parts of datasets.

denis · Accepted Answer · 2024-09-05 13:08:20Z

0

If you have a dict like

mydict = { "label0" : array0, "label1" : array1 ... }

just

save = np.savez( "my.npz", **mydict )
    # == np.savez( "my.npz", label0=array0, label1=array1 ... )

load = np.load( "my.npz" )  # like `mydict`
print( "my.npz labels:" )
print( "\n".join( load.keys() )
array0 = load["label0"]
...

Notes:
Don't compress; do pay attention to the array formats, e.g. np.uint8.
Always add mydict["runinfo"] = "who what when".
For a summary of xx.npz, see the little gist npzinfo.
np.load( ... mmap_mode ) ?

edited Sep 5, 2024 at 13:08

answered Sep 2, 2024 at 17:06

denis

22k12 gold badges68 silver badges92 bronze badges

Collectives™ on Stack Overflow

numpy.save to store 3D Numpy array together with a label

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related