1

For the following code, I only want to parallelize its last part which calculates the second norm of each vector (length of each vector is different) but I am getting an error of segmentation fault. Also, I am not sure whether I am using reduction for the sum in the right place or not.

Another point is that I think I only need to parallelize the outer loop and there is no need to do this for an inner loop. Right?

#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <omp.h>
#include <fstream>
#include <cfloat>
#include <chrono>
using namespace std;

int main()
{
    int N = 1000000;       
    unsigned size;
    vector<vector<double>> c;
    default_random_engine g(0);
    uniform_real_distribution<double> d(0.0f, nextafter(1.0f, DBL_MAX));
    vector<double> b;
    
    for (int i = 0; i < N; i++) {
        size = pow(10, i % 4);
        vector<double> a;

        for (int j = 0; j < size; j++) {
            double number = d(g);
            a.push_back(number);
        }

        c.push_back(a);
    }       
    
    int i, j;
    double sum; 

    #pragma omp parallel for num_threads(4) shared(N) private(i,j,c,b) reduction (+: sum)
    for (int i = 0; i <N ; i++) {
       double sum = 0;
       for (int j = 0; j < c[i].size();j++) {
           sum = sum + pow(c[i][j],2);
       }

       double n = sqrt(sum);
       b.push_back(n);
    }
}

1 Answer 1

3

The segmentation fault is caused by the private clause which does not copy the vectors. It initializes them to default empty vectors. Use firstprivate instead if you want to perform a copy from the "master" thread. That being said, c can be shared here.

Furthermore, here is several important points:

  • sum must be initialized to 0 (outside the loop);
  • the sum variable in the scope of the parallel loop shadow the sum variable outside it (the same also apply for i and j);
  • there is no need to declare a local sum, OpenMP do it for you;
  • you can move a to avoid unnecessary copies and reserve its size before using it (faster);
  • N does not need to be shared between thread (it is better to perform a local copy);
  • since b is private, adding value into it is useless unless it is read locally in each thread (it depends of what you want to do). If you want to read b outside the parallel region, you either need to add a critical section, to merge the thread-local vector parts manually (faster), or to use direct assignment (simplest solution and probably the fastest here).

Here is the corrected code:

#include <iostream>
#include <vector>
#include <random>
#include <cmath>
#include <omp.h>
#include <fstream>
#include <cfloat>
#include <chrono>
using namespace std;

int main()
{
    const int N = 1000000;
    vector<vector<double>> c;
    default_random_engine g(0);
    uniform_real_distribution<double> d(0.0f, nextafter(1.0f, DBL_MAX));
    c.reserve(N);

    for (int i = 0; i < N; i++) {
        const unsigned size = pow(10, i % 4);
        vector<double> a;
        a.reserve(size);

        for (int j = 0; j < size; j++) {
            const double number = d(g);
            a.push_back(number);
        }

        c.push_back(std::move(a));
    }

    double sum = 0.0;
    vector<double> b(N);

    #pragma omp parallel num_threads(4) firstprivate(N) shared(b,c,sum)
    {
        #pragma omp for reduction(+:sum)
        for (int i = 0; i < N ; i++) {
            double sumLocal = 0.0;

            for (int j = 0; j < c[i].size();j++) {
                sumLocal += pow(c[i][j], 2);
            }

            const double n = sqrt(sumLocal);
            b[i] = n;

            sum += sumLocal;
        }
    }
}
Sign up to request clarification or add additional context in comments.

6 Comments

@dreamcrash Actually using firstprivate can be faster in this case because compilers often have issues with constant references (caused by the OpenMP shared): the constant propagation fails since referenced values are not considered as immutable ones and so the value is constantly reloaded in registers. In the worst cases, this prevents the vectorization of the code, resulting in lower performance. You can see this effect with Clang here.
@JérômeRichard wow, that is pretty interesting, nice to know thanks for the link
@JérômeRichard. In the case that I want to parallelize the first part of the code which generates random numbers, I think I should also use firstprivate(N) and shared(a,c) there also. Right?
@JérômeRichard I am getting a memory error related to the parallelization of the first part. Can you modify this code or should I ask a new question?
This is a bit more complex in this case as random number engines are mainly sequential so you cannot use 1 global engine here but 1 per thread for exemple. Consequently, the result will be different than sequential ones.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.