I am trying to run the following example MPI code that launches 20 threads and keeps those threads busy for a while. However, when I check the CPU utilization using a tool like nmon or top I see that only a single thread is being used.
#include <iostream>
#include <thread>
#include <mpi.h>
using namespace std;
int main(int argc, char *argv[]) {
int provided, rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED)
exit(1);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
auto f = [](float x) {
float result = 0;
for (float i = 0; i < x; i++) { result += 10 * i + x; }
cout << "Result: " << result << endl;
};
thread threads[20];
for (int i = 0; i < 20; ++i)
threads[i] = thread(f, 100000000.f); // do some work
for (auto& th : threads)
th.join();
MPI_Finalize();
return 0;
}
I compile this code using mpicxx: mpicxx -std=c++11 -pthread example.cpp -o example and run it using mpirun: mpirun -np 1 example.
I am using Open MPI version 4.1.4 that is compiled with posix thread support (following the explanation from this question).
$ mpicxx --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ mpirun --version
mpirun (Open MPI) 4.1.4
$ ompi_info | grep -i thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
FT Checkpoint support: no (checkpoint thread: no)
$ mpicxx -std=c++11 -pthread example.cpp -o example
$ ./example
My CPU has 10 cores and 20 threads and runs the example code above without MPI on all 20 threads. So, why does the code with MPI not run on all threads?
I suspect I might need to do something with MPI bindings, which I see being mentioned in some answers on the same topic (1, 2), but other answers entirely exclude these options, so I'm unsure whether this is the correct approach.