0

I am new to Eigen and is writing some simple code to test its performance. I am using a MacBook Pro with M1 Pro chip (I do not know whether the ARM architecture causes the problem). The code is a simple Laplace equation solver

#include <iostream>
#include "mpi.h"
#include "Eigen/Dense"
#include <chrono>
 
using namespace Eigen;
using namespace std;

const size_t num = 1000UL;

MatrixXd initilize(){
    MatrixXd u = MatrixXd::Zero(num, num);
    u(seq(1, fix<num-2>), seq(1, fix<num-2>)).setConstant(10);
    return u;
}

void laplace(MatrixXd &u){
    setNbThreads(8);
    MatrixXd u_old = u;

    u(seq(1,last-1),seq(1,last-1)) =
    ((  u_old(seq(0,last-2,fix<1>),seq(1,last-1,fix<1>)) + u_old(seq(2,last,fix<1>),seq(1,last-1,fix<1>)) +
        u_old(seq(1,last-1,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(1,last-1,fix<1>),seq(2,last,fix<1>)) )*4.0 +
        u_old(seq(0,last-2,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(0,last-2,fix<1>),seq(2,last,fix<1>)) +
        u_old(seq(2,last,fix<1>),seq(0,last-2,fix<1>))   + u_old(seq(2,last,fix<1>),seq(2,last,fix<1>)) ) /20.0;
}


int main(int argc, const char * argv[]) {
    initParallel();
    setNbThreads(0);
    cout << nbThreads() << endl;
    MatrixXd u = initilize();
    
    auto start  = std::chrono::high_resolution_clock::now();
    
    for (auto i=0UL; i<100; i++) {
        laplace(u);
    }
    
    auto stop  = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
    
    // cout << u(seq(0, fix<10>), seq(0, fix<10>)) << endl;
    cout << "Execution time (ms): " << duration.count() << endl;
    return 0;
}

Compile with gcc and enable OpenMPI

james@MBP14 tests % g++-11 -fopenmp  -O3 -I/usr/local/include -I/opt/homebrew/Cellar/open-mpi/4.1.3/include -o test4 test.cpp

Direct run the binary file

james@MBP14 tests % ./test4
8
Execution time (ms): 273

Run with mpirun and specify 8 threads

james@MBP14 tests % mpirun -np 8 test4
8
8
8
8
8
8
8
8
Execution time (ms): 348
Execution time (ms): 347
Execution time (ms): 353
Execution time (ms): 356
Execution time (ms): 350
Execution time (ms): 353
Execution time (ms): 357
Execution time (ms): 355

So obviously the matrix operation is not running in parallel, instead, every thread is running the same copy of the code.

What should be done to solve this problem? Do I have some misunderstanding about using OpenMPI?

1 Answer 1

2

You are confusing OpenMPI with OpenMP.

  • The gcc flag -fopenmp enables OpenMP. It is one way to parallelize an application by using special #pragma omp statements in the code. The parallelization happens on a single CPU (or, to be precise, compute node, in case the compute node has multiple CPUs). This allows to employ all cores of that CPU. OpenMP cannot be used to parallelize an application over multiple compute nodes.
  • On the other hand, MPI (where OpenMPI is one particular implementation) can be used to parallelize a code over multiple compute nodes (i.e., roughly speaking, over multiple computers that are connected). It can also be used to parallelize some code over multiple cores on a single computer. So MPI is more general, but also much more difficult to use.

To use MPI, you need to call "special" functions and do the hard work of distributing data yourself. If you do not do this, calling an application with mpirun simply creates several identical processes (not threads!) that perform exactly the same computation. You have not parallelized your application, you just executed it 8 times.

There are no compiler flags that enable MPI. MPI is not built into any compiler. Rather, MPI is a standard and OpenMPI is one specific library that implements that standard. You should read a tutorial or book about MPI and OpenMPI (google turned up this one, for example).

Note: Usually, MPI libraries such as OpenMPI ship with executables/scripts (e.g. mpicc) that behave like compilers. But they are just thin wrappers around compilers such as gcc. These wrappers are used to automatically tell the actual compiler the include directories and libraries to link with. But again, the compilers themselves to not know anything about MPI.

Sign up to request clarification or add additional context in comments.

5 Comments

"To use MPI, you need to call "special" functions ". Yes, but well-written libraries hide all that from you and allow you to write what almost looks like sequential code, where MPI is done completely in the library calls. Eigen is not such a library.
Thanks for your detailed explanation. For OpenMP, let's say I have 8 cores, does it use all of them automatically? I found that setNbThreads(8) and setNbThreads(4) gave the same performance. What's more interesting, even I don't setNbThreads manually, and use Apple Clang instead without -fopenmp (it doesn't support either), the performance is roughly the same. Does that mean even my OpenMP is not done correctly?
If you run your program with mpirun, I think OpenMPI is setting the default number of OpenMP threads to 1 or something like this, if I can remember. (Combining OpenMPI and OpenMP requires real extra care). Running your program without mpirun does use all cores by default. Calling setNbThreads() is not necessary for this. BUT: Not all algorithms in Eigen are parallelized! The ones that are parallelized are documented here. Notably, matrix-matrix products are parallelized, but you are not using those, as far as I can see.
Thanks a lot! I only have matrix addition A+B which may not have a built-in parallel implementation. This essentially means I need to change another library to do so, if I need this Laplace solver to run in parallel.
Yes, either you need to change the library or parallelize it yourself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.