1

I am trying to add multi-threading in a C++ code. The target is the for loop inside the function. The objective is to reduce the execution time of the program. It takes 3.83 seconds for execution.

I have tried to add the command #pragma omp parallel for reduction(+:sum) in the inner loop (before the j for-loop) but it was not enough. It took 1.98 seconds. The aim is to decrease the time up to 0.5 seconds.

I made some research to increase the speed up and some people recommend the Strip Mining method for Vectorization for better results. However I do not know how to implement it yet.

Could someone know how to do it ?

The code is:

void filter(const long n, const long m, float *data, const float threshold, std::vector &result_row_ind) {

  for (long i = 0; i < n; i++) {
    float sum = 0.0f;
    for (long j = 0; j < m; j++) {
      sum += data[i*m + j];
    } 
    if (sum > threshold) 
      result_row_ind.push_back(i);
  }
  std::sort(result_row_ind.begin(),
            result_row_ind.end());
}

Thank you very much

4
  • 1
    What are the values for n, m and result_row_ind.size() when it takes 3.83 seconds? Commented Jan 10, 2019 at 17:42
  • First, take i*m outside of the inner loop. Second, you could parallelize this by segmenting the outer loop and passing each segment to a different thread. You then sum the results when the threads return. What version of C++ is this? Commented Jan 10, 2019 at 17:48
  • Hello Ted, n is the number of datasets, m is the dataset size, and result_row_ind is the output. They are define as: const long n = 1<<15; (rows) const long m = 1<<18; (columns). Commented Jan 10, 2019 at 22:18
  • Hello Thomas, I have already put 'i*m' outside of the inner loop and ran the code. It decreased to 1.02 seconds. (I am still considering the #pragma reduction that I mentioned before). Now I need to start the parallelize procedure that you suggest. The c++ is the 11. Commented Jan 10, 2019 at 22:30

2 Answers 2

3

When possible, you likely want to parallelize the outer loop. The simplest way to go about this in OpenMP is to do this:

#pragma omp parallel for
for (long i = 0; i < n; i++) {
  float sum = 0.0f;
  for (long j = 0; j < m; j++) {
    sum += data[i*m + j];
  }
  if (sum > threshold) {
    #pragma omp critical
    result_row_ind.push_back(i);
  }
}
std::sort(result_row_ind.begin(),
          result_row_ind.end());

This works, and is probably a great deal faster than parallelizing the inner loop (launching a parallel region is expensive), but it uses a critical section for locking to prevent races. The race could also be avoided by using a user defined reduction over vectors with a reduction on that loop, if the number of threads is very large and the number of matching results is very small this might be slower, but otherwise it is likely notably faster. This is not quite right, the vector type is incomplete since it wasn't listed, but should be pretty close:

#pragma omp declare \
  reduction(CatVec: std::vector<T>: \
    omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end())) \
  initializer(omp_priv=std::vector<T>())

#pragma omp parallel for reduction(CatVec: result_row_ind)
for (long i = 0; i < n; i++) {
  float sum = 0.0f;
  for (long j = 0; j < m; j++) {
    sum += data[i*m + j];
  }
  if (sum > threshold) {
    result_row_ind.push_back(i);
  }
}
std::sort(result_row_ind.begin(),
          result_row_ind.end());
Sign up to request clarification or add additional context in comments.

Comments

0

If you have a C++ compiler with support for execution policies, you could try std::for_each with the execution policy std::execution::par to see if that helps. Example:

#include <iostream>
#include <vector>
#include <algorithm>
#if __has_include(<execution>)
# include <execution>
#elif __has_include(<experimental/execution_policy>)
# include <experimental/execution_policy>
#endif

// iterator to use with std::for_each
class iterator {
    size_t val;
public:
    using iterator_category = std::forward_iterator_tag;
    using value_type = size_t;
    using difference_type = size_t;
    using pointer = size_t*;
    using reference = size_t&;

    iterator(size_t value=0) : val(value) {}
    inline iterator& operator++() { ++val; return *this; }
    inline bool operator!=(const iterator& rhs) const { return val != rhs.val; }
    inline reference operator*() { return val; }
};

std::vector<size_t> filter(const size_t rows, const size_t cols, const float* data, const float threshold) {
    std::vector<size_t> result_row_ind;
    std::vector<float> sums(rows);

    iterator begin(0);
    iterator end(rows);

    std::for_each(std::execution::par, begin, end, [&](const size_t& row) {
        const float* dataend = data + (row+1) * cols;
        float& sum = sums[row];

        for (const float* dataptr = data + row * cols; dataptr < dataend; ++dataptr) {
            sum += *dataptr;
        }
    });

    // pushing moved outside the threaded code to avoid using mutexes
    for (size_t row = 0; row < rows; ++row) {
        if (sums[row] > threshold)
            result_row_ind.push_back(row);
    }
    std::sort(result_row_ind.begin(),
        result_row_ind.end());

    return result_row_ind;
}

int main() {
    constexpr size_t rows =  1<<15, cols = 1<<18;
    float* data = new float[rows*cols];

    for (int i = 0; i < rows*cols; ++i) data[i] = (float)i / (float)100000000.;

    std::vector<size_t> res = filter(rows, cols, data, 10.);
    std::cout << res.size() << "\n";

    delete[] data;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.