1
void calc_mean(float *left_mean, float *right_mean, const uint8_t* left, const     uint8_t* right, int32_t block_width, int32_t block_height, int32_t d, uint32_t w, uint32_t h, int32_t i,int32_t j)
{
*left_mean = 0;
*right_mean = 0;
int32_t i_b;
float local_left = 0, local_right = 0;

for (i_b = -(block_height-1)/2; i_b < (block_height-1)/2; i_b++) {
    #pragma omp parallel for reduction(+:local_left,local_right)
    for ( int32_t j_b = -(block_width-1)/2; j_b < (block_width-1)/2; j_b++) {
        // Borders checking
        if (!(i+i_b >= 0) || !(i+i_b < h) || !(j+j_b >= 0) || !(j+j_b < w) || !(j+j_b-d  >= 0) || !(j+j_b-d < w)) {
            continue;
        }
        // Calculating indices of the block within the whole image
        int32_t ind_l = (i+i_b)*w + (j+j_b);
        int32_t ind_r = (i+i_b)*w + (j+j_b-d);
        // Updating the block means
        //*left_mean += *(left+ind_l);
        //*right_mean += *(right+ind_r);
        local_left += left[ind_l];
        local_right += right[ind_r];
    }
}

*left_mean = local_left/(block_height * block_width);
*right_mean = local_right/(block_height * block_width);

}

This now makes the program execution longer than non-threaded version. I added private(left,right) but it leads to bad memory access for ind_l.

3
  • What are the values of block_height and block_width? If the inner loop has just a handful of iterations, engaging OpenMP that way will not bring anything good. Commented Mar 13, 2018 at 10:57
  • calling function sets it to 9 and 9. Commented Mar 13, 2018 at 12:37
  • I.e. you have just 81 simple iterations. A single call into most OpenMP runtimes to setup the parallel region will take longer than that, and you have 9 parallel regions (one for each iteration of the outer loop). You should definitely think of adding parallelism on a higher level then, e.g. somewhere in the function that calls calc_mean. Commented Mar 14, 2018 at 11:29

1 Answer 1

2

I think this should get you closer to what you want, although I'm not quite sure about one final part.

float local_left, local_right = 0;

for ( int32_t i_b = -(block_height-1)/2; i_b < (block_height-1)/2; i_b++) {

    #pragma omp for schedule(static, CORES) reduction(+:left_mean, +: right_mean)
    {
        for ( int32_t j_b = -(block_width-1)/2; j_b < (block_width-1)/2; j_b++) {

            if (your conditions) continue;

            int32_t ind_l = (i+i_b)*w + (j+j_b);
            int32_t ind_r = (i+i_b)*w + (j+j_b-d);

            local_left += *(left+ind_l);
            local_right += *(right+ind_r);
        }
    }
}

*left_mean = local_left/(block_height * block_width);
*right_mean = local_right/(block_height * block_width);

Part I am unsure of is whether you need the schedule() and how to do two different reductions. I know for one reduction, you can simply do

reduction(+:left_mean)

EDIT: some reference for the schedule() http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html#Loopschedules It looks like you do not need this, but using it could produce a better runtime

Sign up to request clarification or add additional context in comments.

2 Comments

I don't see how your code differs from the one in the question. Static loop scheduling with small chunk size only introduces more overhead and *(A+b) is no faster than A[b].
i modified the question based on AndrewGrant's solution. Couldn't add too long comment in this box.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.