OpenMP parallel for loop

Question

void calc_mean(float *left_mean, float *right_mean, const uint8_t* left, const     uint8_t* right, int32_t block_width, int32_t block_height, int32_t d, uint32_t w, uint32_t h, int32_t i,int32_t j)
{
*left_mean = 0;
*right_mean = 0;
int32_t i_b;
float local_left = 0, local_right = 0;

for (i_b = -(block_height-1)/2; i_b < (block_height-1)/2; i_b++) {
    #pragma omp parallel for reduction(+:local_left,local_right)
    for ( int32_t j_b = -(block_width-1)/2; j_b < (block_width-1)/2; j_b++) {
        // Borders checking
        if (!(i+i_b >= 0) || !(i+i_b < h) || !(j+j_b >= 0) || !(j+j_b < w) || !(j+j_b-d  >= 0) || !(j+j_b-d < w)) {
            continue;
        }
        // Calculating indices of the block within the whole image
        int32_t ind_l = (i+i_b)*w + (j+j_b);
        int32_t ind_r = (i+i_b)*w + (j+j_b-d);
        // Updating the block means
        //*left_mean += *(left+ind_l);
        //*right_mean += *(right+ind_r);
        local_left += left[ind_l];
        local_right += right[ind_r];
    }
}

*left_mean = local_left/(block_height * block_width);
*right_mean = local_right/(block_height * block_width);

}

This now makes the program execution longer than non-threaded version. I added private(left,right) but it leads to bad memory access for ind_l.

What are the values of block_height and block_width? If the inner loop has just a handful of iterations, engaging OpenMP that way will not bring anything good. — Hristo Iliev
– Hristo Iliev, Commented Mar 13, 2018 at 10:57
I.e. you have just 81 simple iterations. A single call into most OpenMP runtimes to setup the parallel region will take longer than that, and you have 9 parallel regions (one for each iteration of the outer loop). You should definitely think of adding parallelism on a higher level then, e.g. somewhere in the function that calls calc_mean. — Hristo Iliev
– Hristo Iliev, Commented Mar 14, 2018 at 11:29

AndrewGrant · Accepted Answer · 2018-03-12 02:37:38Z

2

I think this should get you closer to what you want, although I'm not quite sure about one final part.

float local_left, local_right = 0;

for ( int32_t i_b = -(block_height-1)/2; i_b < (block_height-1)/2; i_b++) {

    #pragma omp for schedule(static, CORES) reduction(+:left_mean, +: right_mean)
    {
        for ( int32_t j_b = -(block_width-1)/2; j_b < (block_width-1)/2; j_b++) {

            if (your conditions) continue;

            int32_t ind_l = (i+i_b)*w + (j+j_b);
            int32_t ind_r = (i+i_b)*w + (j+j_b-d);

            local_left += *(left+ind_l);
            local_right += *(right+ind_r);
        }
    }
}

*left_mean = local_left/(block_height * block_width);
*right_mean = local_right/(block_height * block_width);

Part I am unsure of is whether you need the schedule() and how to do two different reductions. I know for one reduction, you can simply do

reduction(+:left_mean)

EDIT: some reference for the schedule() http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html#Loopschedules It looks like you do not need this, but using it could produce a better runtime

answered Mar 12, 2018 at 2:37

AndrewGrant

8089 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Hristo Iliev Over a year ago

I don't see how your code differs from the one in the question. Static loop scheduling with small chunk size only introduces more overhead and *(A+b) is no faster than A[b].

DKV Over a year ago

i modified the question based on AndrewGrant's solution. Couldn't add too long comment in this box.

Collectives™ on Stack Overflow

OpenMP parallel for loop

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related