3

I have huge np.memmap objects and need to expand them regularly. I was wondering if my current approach is safe and also the most efficient one and started searching the internet. I stumbled across the following stackoverflow-question:

stackoverflow: resizing numpy memmap arrays

As the above question is 12 years old, I decided to open a new question.

The approach I am currently using is the one mentioned at the very bottom of the above post:

arr_shape = (1000, 256)
arr = np.memmap(f"{output_dir}/fingerprints.mm", dtype="float32", mode="w+", shape=arr_shape)
# some stuff is written to arr
arr.flush()

new_shape = (2000, 256)
arr = np.memmap(f"{output_dir}/fingerprints.mm", dtype="float32", mode="r+", shape=new_shape)
# some stuff is written to arr[arr_shape[0]:, :]
arr.flush()

In this manner, I increase the size of the memmap to something like (1e8, 256).

My questions:

  • In the comments section of this appoach, Michael answers "[...] I would prefer an inplace solution, [...]". Is this not an in-place solution? I figured it would be. Does this method copy the whole memmap into a new one?
  • If it actually is an in-place solution: Is this approach safe? What happens if the following blocks in memory are already in use?
  • If it is not an in-place solution: Are there more efficient ways to do this?
  • There is also a memmap.resize method (memmap.resize docs). However, whereever I read about it, people claim that they have various issues with this method. But all these posts are quite old and I am not sure if these issues still persist.
  • ChatGPT proposes a solution with truncating the file, however, I cannot find any references for this approach:

.

n_bytes = np.prod(new_shape) * np.dtype("float32").itemsize

with open(path, "r+b") as f:
    f.truncate(n_bytes)

arr = np.memmap(path, dtype="float32", mode="r+", shape=new_shape)
4
  • 2
    "However, whereever I read about it, people claim that they have various issues with this method." There's an issue on the NumPy issue tracker about this, which is still open. github.com/numpy/numpy/issues/4198 If the problems mentioned in the issue were solved, I would expect that to be closed, so I would guess the problems still exist. Commented Oct 28 at 14:52
  • Thanks for the info, so memmap.resize is not the way to go for now :) Commented Oct 28 at 15:43
  • 1
    Re "in-place": When you mean copying the mapped data or not, then yes, the resizing is in-place. No copying will take place. However, the arr object itself is not updated in-place. You get a new object representing the same (enlarged) memory. If you keep other references to the old object around, they will continue to see the same memory with the old size. Commented Oct 28 at 16:03
  • I actually meant “in-place” in terms of the memory on disk (i.e. whether the data gets copied). I understand now that what Michael meant with “I would prefer an inplace solution” was referring to the Python object itself, not the underlying file - Thanks for the clarification! Commented Oct 29 at 8:46

2 Answers 2

1

Different meanings of "in-place"

Part of the confusion with the other post comes from what exactly is to be changed in-place.

One interpretation is changing the size of the mapped file without copying any of the underlying data. This is accomplished by your code. No copy takes place. At the end of your code snippet, you will have an arr object that refers to the same data on-disk and, as far as the data is still in the page cache, the same physical pages in memory; not counting the new memory, of course.

This data will be mapped to a different virtual address, as you can see by inspecting arr.data, but it is the same physical memory. If you keep the old arr object around, you will find that both map the same memory and writing to one will be visible to the other. This works across processes. The mapping is said to be coherent. This also works on Windows. However, the mapping may or may not be coherent with regular reads and writes. That's what the flush() method is for.

Another interpretation of "in-place" is changing the arr object to represent the larger memory range. In this regard, the snippet is not in-place, while a functioning (see bug) memmap.resize() would. I believe this is what the comments on the other question refer to. As discussed above, you get a new arr object and have to replace all references to the old arr yourself.

Portability

There are some concerns with different system behaviors and your code. If this was C/C++, then simply mapping a larger region would result in undefined (or OS-specific) behavior. For example the POSIX standard for mmap defines that larger regions can be mapped but that accessing them will result in SIGBUS errors. On Windows, the file would be extended.

You would get this behavior if you used the mmap module and then created a numpy array with np.frombuffer. However, numpy.memmap takes care of extending the file across platforms. I've not inspected the code but traced the system calls. Numpy checks the file size and expands the file to the appropriate minimum. Unless Numpy changes its behavior, your code is perfectly fine.

The way this is accomplished, at least on Linux and my particular numpy version 1.26.4, is to seek to the end of the file and write a single zero byte. We thus create a sparse file. I don't know why they don't use os.ftruncate, probably portability concerns.

Performance

Expanding the array like this is not necessarily the fastest way. In general, memory-mapped IO works best for repeated, random access to data that is already in memory (the OS's page-cache). Large sequential IO and appending a file is often faster with normal routines. Additionally not all file systems support sparse files in which case the file expansion done by Numpy may actually fill the file on disk with zeros before reading it back with the memory mapping. Exact performance depends on the use case. Consider something like this (and always benchmark!):

arr = np.memmap("fingerprints.mm", …)
arr.flush()
new_data = np.array(…)
with open("fingerprints.mm", "ab") as fout:
    fout.write(new_data)
new_shape = (len(arr) + len(new_data), ) + arr.shape[1:]
arr = np.memmap("fingerprints.mm", …, shape=new_shape)

Meaning, fill the file with data before mapping the written data for repeated access. Of course all of this depends on your use-case. The above only really applies if the new appended data exists as a regular Numpy array at some point or in a similar form.

Other questions

What happens if the following blocks in memory are already in use?

I assume you mean the disk memory. You will see the old data already in the file. If another process has the file opened, you will see each other's memory writes. If you don't want that, create a new file or truncate it. For example:

with open("fingerprints.mm", "w+b") as fout:
    arr = np.memmap(fout, …, shape=arr_shape)

"w+b" truncates any existing file. Be careful that you don't truncate a file that another process has mapped. Use other modes as appropriate, e.g. "x+b" to create a file only if it does not already exists. Or remove old file and create a new one. Existing mappings to removed ("unlinked") files continue to work.

If you mean virtual memory when you say following blocks in memory, then this is no issue because the virtual address changes anyway. The OS will find a new, suitably large location. In theory you can run out of virtual memory, especially due to fragmentation. However, that is only a concern on 32 bit platforms.

The physical memory is not contiguous anyway. Pages are allocated as they are requested and the allocation uses whatever is available or can be reused with the least expected impact on other uses, e.g. using the least recently used page from the page cache. A new mapping changes nothing about this.

Sign up to request clarification or add additional context in comments.

Comments

0

Does this method copy the whole memmap into a new one?

Let us analyse what exactly happens in your code. First, arr is mapped to the file "fingerprints.mm". This means data pointer in the arr internal interface point on a virtual memory area mapped to the file. Touching the memory page for the first time will trigger reads/writes (further reads/writes may do the same regarding the OS caching policy and the amount of memory available). When arr.flush() is performed, you know that everything has been written to the storage device.

When the second arr = np.memmap(...) is performed, a new virtual memory area is requested from the OS with a new mapping to the same file here. The OS implementation should map the virtual pages of the newly mapped area to the same physical memory pages than the old one (this at least the case on Linux and apparently on Windows too). That being said, virtual pages are generally not all directly mapped to these physical pages (i.e. when the np.memmap is done), but lazily when you read/write them for the first time on mainstream systems. Page can be reloaded if there is not enough memory (see page cache on Linux).

In you specific case, the second part of your code does not read data written in the first part. It only writes in arr[arr_shape[0]:, :] and so the associated memory pages. As a result, there is no reason for the OS to fetch the unrequested pages (i.e. the ones associated to arr[:arr_shape[0], :] (confirmed on both Linux and Windows).

Technically, please note that the file system may require blocks to be copied because of the lack of space after the file content area (but this is independent of memory mapping).

Thus, put it shortly, here, memory pages should not be copied nor fetched again from the storage device (at least from the user-land PoV).


Is this approach safe?

It assume the file is not modified between the two operations. It also only works if data is appended at the end (you cannot increase the size of the last axis with a contiguous array because data was stored contiguously in the first place).


What happens if the following blocks in memory are already in use?

I am not sure to understand the question.

The data pointer of the first arr points on a virtual memory area which is not the same of the second one. If you are not convinced by this, then you can check the value of arr.__array_interface__['data']. The associated physical memory pages may or may not be contiguous. In fact, physical pages associated to virtual pages of the first arr could also not be contiguous. The same thing applies for the storage device pages: the file may not be a contiguous block once stored the storage device. Appending may increase the fragmentation of the blocks on the storage device regarding the target file system (as well as the OS and the drivers). Blocks may or may not be relocated dynamically when new data are written (even by the device itself though the address will stay the same from the OS PoV).

Moreover, this is the job of the OS to guarantee that the multiple memory mapped area on the same file stay coherent. However, this is you job to ensure proper synchronisation between accesses on them (i.e. not to read/write from multiple threads or even processes without synchronisation between them). That being said, when the file is resized, I am not sure it is 100% guaranteed to access the old memory mapped section after the mapped file has been resized with a new memmap so I advise you not to do that. So far, experiments on Windows seems to indicate this is OK (the first and second arr stay coherent and modifying data in the first impact the second and the other way around), but absence of evidence is not evidence of absence.


ChatGPT proposes a solution with truncating the file, however, I cannot find any references for this approach:

I expect truncate to possibly invalidate the memory mapped done before the truncation, especially if the file size is smaller than the previous one. For appending new data, I expect truncate to be (up to twice) slower because the file needs to be padded with zeros (and possibly flushed) prior to writing new data into this file.

7 Comments

I think your understanding is wrong. "The OS implementation might be smart enough so virtual pages of the new mapping reference the same physical memory pages, but this is not guaranteed". No, it will work unless you map with copy-on-write (mode='c'). That's how memory mapped IO always works. That's also how named shared memory regions between processes (shm_open) work. You'd break a ton of database systems if its processes couldn't share memory mapped file pages.
I am not sure to understand your point. This is certainly guaranteed on Linux, but I was not sure for Windows at the time of writing this sentence. I have no idea if an OS can implement this differently but AFAIK this is a legit implementation as long as it behave "as if" page would be coherent. For example a possible (very inefficient) implementation would be to copy data from one page to another and track memory access to ensure coherence (like some distributed shared memory systems do AFAIK). Maybe I should have wrote "I do not know if this is guaranteed" instead. I will clarify t
By the way, it looks like this is guaranteed on Windows based on the current Python implementation but I failed to find strong evidences this is the case in practice. Microsoft debugging tools seem actually bogus : they fail to find some virtual pages of the mapping... I can just see that not all pages are mapped to physical ones; that the range of physical pages is not contiguous at all (the mapping looks completely random to me). The amount of (physical) memory used does not seems to increase with the number of mapping at least which is a good clue.
Yes, windows file mappings are coherent (see section Remarks in link) across multiple mappings, as long as they are not remote files. Creating a mapping also automatically grows the file if required, which is what we want.
Thanks for taking the time! "That being said, when the file is resized, I am not sure it is 100% guaranteed to access the old memory mapped section after the mapped file has been resized with a new memmap so I advise you not to do that." - What would be the best way in your opinion to do that safely then?
The POSIX standard for mmap leaves it open what happens with mappings and resizing (truncate calls). However, the concern is w.r.t. memory that either was mapped beyond the end of the file before expanding the file or that is mapped beyond the end after reducing the file size. Your type of usage is fine. One concern that Jérôme pointed out is that it is generally faster to fill the file with a write and then map it rather than using the mapping to fill the file.
You could just only use the new mapping and discard the previous one. This is what your current code already does: arr is set to np.memmap so it references a new object and the old object is deleted (because there is no reference on it anymore). Please note that the virtual pages located in beginning of the file must be mapped again by the OS (this is transparent to you but slower). That being said, IO operations are more likely to be a bottleneck than paging.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.