What causes increasing memory consumption in OpenMP-based simulation?

Question

The problem

I am having a big struggle with memory consumption in my Monte Carlo particle simulation, where I am using OpenMP for parallelization. Not going into the details of the simulation method, one parallel part are "particle moves" using some number of threads and the other are "scaling moves" using some, possibly different number of threads. This 2 parallel codes are run interchangeably separated by some serial core and each takes milliseconds to run.

I have an 8-core, 16-thread machine running Linux Ubuntu 18.04 LTS and I'am using gcc and GNU OpenMP implementation. Now:

using 8 threads for "particle moves" and 8 threads for "scaling moves" yields stable 8-9 MB memory usage
using 8 threads for "particle moves" and 16 threads for "scaling moves" causes increasing memory consumption from those 8 MB to tens of GB for long simulation resulting in the end in an OOM kill
using 16 threads and 16 threads is ok
using 16 threads and 8 threads causes increasing consumption

So something is wrong if numbers of threads for those 2 types of moves don't match.

~~Unfortunately, I was not able to reproduce the issue in a minimal example and I can only give a summary of the OpenMP code~~. A link to aminimal example is at the bottom.

In the simulation I have N particles with some positions. "Particle moves" are organized in a grid, I am using collapse(3) to distribute threads. The code looks more or less like this:

// Each threads has its own cell in a 2 x 2 x 2 grid
#pragma omp parallel for collapse(3) num_threads(8 or 16)
for (std::size_t i = 0; i < 2; i++) {
    for (std::size_t j = 0; j < 2; j++) {
        for (std::size_t k = 0; k < 2; k++) {
            std::array<std::size_t, 3> gridCoords = {i, j, k};
            
            // This does something for all particles in {i, j, k} grid cell
            doIndependentParticleMovesInAGridCellGivenByCoords(gridCoords);
        }
    }
}

(Notice, that only 8 threads are to be distributed in both cases - 8 and 16, but using those additional, jobless 8 threads magically fixes the problem when 16 scaling threads are used.)

In "volume moves" I am doing an overlap check on each particle independently and exit when a first overlap is found. It looks like this:

// We independently check for each particle
std::atomic<bool> overlapFound = false;
#pragma omp parallel for num_threads(8 or 16)
for (std::size_t i = 0; i < N; i++) {
    if (overlapFound)
        continue;
    if (isParticleOverlappingAnything(i))
        overlapFound = true;
}

Now, in parallel regions I don't allocate any new memory and don't need any critical sections - there should be no race conditions.

Moreover, all memory management in the whole program is done in a RAII fashion by std::vector, std::unique_ptr, etc. - I don't use new or delete anywhere.

Investigation

I tried to use some Valgrind tools. I ran a simulation for a time, which produces about 16 MB of (still increasing) memory consumption for non-matching thread numbers case, while is stays still on 8 MB for matching case.

Valgrind Memcheck does not show any memory leaks (only a couple of kB "still reachable" or "possibly lost" from OpenMP control structures, see here) in either case.
Valgrind Massif reports only those "correct" 8 MB of allocated memory in both cases.

I also tried to surround the contents of main in { } and add while(true):

int main() {
    {
        // Do the simulation and let RAII do all the cleanup when destructors are called
    }

    // Hang
    while(true) { }
}

During the simulation memory consumption increases let say up to 100 MB. When { ... } ends its execution, memory consumption gets lower by around 6 MB and stays at 94 in while(true) - 6 MB is the actual size of biggest data structures (I estimated it), but the remaining part is of an unknown kind.

Hypothesis

So I assume it must be something with OpenMP memory management. Maybe using 8 and 16 threads interchangeably causes OpenMP to constantly create new thread pools abandoning old ones without releasing resources? I found something like this here, but it seems to be another OpenMP implementation.

I would be very grateful for some ideas what else can I check and where might be the issue.

re @1201ProgramAlarm: I have changed volatile to std::atomic
re @Gilles: I have checked 16 threads case for "particle moves" and updated accordingly

Minimal example

I was finally able to reproduce the issue in a minimal example, it ended up being extremely simple and all the details here are unnecessary. I created a new question without all the mess here.

Is it possible you have a data structure that isn't being reset/cleared? Possibly a race condition, which is more likely with more threads. Do you use thread_local variables? Also overlapFound should be std::atomic<bool>. Declaring it volatile is not sufficient. — 1201ProgramAlarm
– 1201ProgramAlarm, Commented Apr 16, 2021 at 20:06
@1201ProgramAlarm I can check race conditions using Helgrind, I will update the question. I thought it might be not cleared structure getting bigger and bigger, but it should be reclaimed by destructors at the end of { ... } anyway and it should be detected by Massif. Or maybe there could be no problem in the run inside Valgrind, but there is one when run standalone? Is it possible? I didn't use thread_local. And could you elaborate why should I use std::atomic? — PKua
– PKua, Commented Apr 16, 2021 at 20:40
See this question for why you should use atomic instead of volatile. (Essentially, volatile may work properly, or appear to work properly, on some systems, but for multithreading like this it is not guaranteed to be thread safe by the language standard.) — 1201ProgramAlarm
– 1201ProgramAlarm, Commented Apr 16, 2021 at 21:46
Thank you for the explanation, it's always good to avoid undefined behavior. I have corrected the code in the question accordingly. The problem still persists, however. — PKua
– PKua, Commented Apr 17, 2021 at 0:14
Providing a minimal working example would help us a lot to track and reproduce the problem. I think the amount of possible issues is too big so one can really help you. — Jérôme Richard
– Jérôme Richard, Commented Apr 17, 2021 at 12:51

PKua · Accepted Answer · 2021-04-29 18:01:20Z

Where lies the problem?

It seem that the problem is not connected with what this particular code does or how the OpenMP clauses are structured, but solely with two alternating OpenMP parallel regions with different numbers of threads. After millions of those alterations there is a substantial amount of memory used by the process irregardless of what is in the sections. They may be even as simple as sleeping for a couple of milliseconds.

As this question contains too much unnecessary details I have moved the discussion to a more direct question here. I refer there the interested reader.

A brief summary of what happens

Here I give a brief summary of what StackOverflow members and I were able to determine. Let's say we have 2 OpenMP sections with different number of threads, such as here:

#include <unistd.h>

int main() {
    while (true) {
        #pragma omp parallel num_threads(16)
        usleep(30);

        #pragma omp parallel num_threads(8)
        usleep(30);
    }
    return 0;
}

As described with more details here, OpenMP reuses common 8 threads, but other 8 needed for 16-thread section are constantly created and destroyed. This constant thread creation causes increasing memory consumption, either because of an actual memory leak, or memory fragmentation, I don't know. Moreover, the problem seems to be specific to GOMP OpenMP implementation in GCC (up to at least version 10). Clang and Intel compilers seem not to replicate the issue.

Although not stated explicitly by the OpenMP standard, most implementations tend to reuse the already spawned threads, but is seems not to be the case for GOMP and it is probably a bug. I will file the bug issue and update the answer. For now, the only workaround is to use the same number of threads in every parallel region (then GOMP properly reuses old threads). In cases like collapse loop from the question, when there are less threads to distribute than in the other section, one can always request 16 threads instead of 8 and let the other 8 just do nothing. It worked in my "production" code quite well.

Collectives™ on Stack Overflow

What causes increasing memory consumption in OpenMP-based simulation?

The problem

Investigation

Hypothesis

Minimal example

1 Answer 1

Where lies the problem?

A brief summary of what happens

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

The problem

Investigation

Hypothesis

Minimal example

1 Answer 1

Where lies the problem?

A brief summary of what happens

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related