How to optimize this OpenCL kernel?

Question

I'm working on a project and I've got some problems with this OpenCL kernel :-(

__kernel void gemm_fast_5(
    __global double *ar, __global double *br, __global double *cr,
    __global double *pr, __global double *ur,

    unsigned long c, unsigned long c2,
    unsigned long c3, unsigned long c4,
    unsigned long c5, unsigned long m,
    unsigned char com
){
    unsigned long i = get_global_id(0);
    unsigned long j = get_global_id(1);

    unsigned long x = get_local_id(0);
    unsigned long y = get_local_id(1);

    unsigned long cur = i*c3 + j, rl, rl2, rl3;

    #if ks == 1 || ks == 2 || ks == 3 || ks == 4
    unsigned long rl4;
    #endif


    #if ks == 2
    rl = (i << 1)*c;
    #elif ks == 3
    rl = ((i << 1) + 1)*c;
    #else
    rl = i*c;
    #endif

    __local double ut, pt;

    if (x == 0) pt = pr[i*c4 + ks];
    if (y == 0) ut = ur[j*c5 + ks];

    double aa = 0.0;

    double bb, cc;
    double dd, ee;

    for (unsigned long k=0; k<m; k++){
        #if ks == 1 || ks == 4
        rl3 = (k << 1) + 1; rl4 = (k << 2) + 3;

        bb = ar[rl + rl3 - 1]; cc = ar[rl + rl3];
        dd = br[rl2 + rl4 - 1]; ee = br[rl2 + rl4 - 3];
        #elif ks == 2 || ks == 3
        rl3 = (k << 2) + 3; rl4 = (k << 1) + 1;

        bb = ar[rl + rl3 - 3]; cc = ar[rl + rl3 - 2];
        dd = br[rl2 + rl4]; ee = br[rl2 + rl4 - 1];
        #else
        rl3 = (k << 1) + 1;

        bb = ar[rl + rl3 - 1]; cc = ar[rl + rl3];
        dd = br[rl2 + rl3]; ee = br[rl2 + rl3 - 1];
        #endif

        aa += (bb + dd)*(cc + ee);
    }
    cr[cur] = aa - pt - ut;
}

While working, I noticed that if I delete the last line, the kernel takes 6 times less to run even changing the last line with cr[cur] = 5.0 - pt - ut; for example.

Shouldn't it take the same, or something similar at least? Even looking for an answer, taking advantage of the fact that I have CPU and GPU, I have tried in several runtime (PoCL and opencl-amd) and the same thing happens :-/

I would be very grateful if someone would give me a hand in making me understand why this happens. I don't understand :"v

Marco Bonelli · Accepted Answer · 2020-11-07 03:22:27Z

1

All the operations inside the loop do not have side effects, you only read from those __global pointers, and you calculate some temporary values that in the end get accumulated into aa through that final aa += .... In other words, the sole purpose of that loop is to calculate the value of aa.

Therefore, if you remove aa from the last line (outside the loop), all the operations inside the loop are completely useless, and you end up with a loop that does nothing except reading some values and updating local variables that will get discarded at function return. Compiling the above code with optimizations enabled (which I assume you are doing, otherwise your question wouldn't make much sense), the compiler is very likely to just get rid of the entire loop. Hence, the code without that final aa runs a lot faster.

Here's a GCC example (adapted removing CUDA annotations), where you can see that even the lowest level of optimization (-O1) removes the entire body of the loop, leaving only comparisons and the incrementing of i. With -O2, the whole loop is removed.

edited Nov 7, 2020 at 3:22

answered Nov 7, 2020 at 3:10

Marco Bonelli

71.1k21 gold badges129 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Enrique Miguel Mora Meza Over a year ago

I never thought that compilers would do that, I thought they didn't care: v

Marco Bonelli Over a year ago

@EnriqueMiguelMoraMeza welcome to the magical world of compile-time optimization!

Enrique Miguel Mora Meza Over a year ago

Yes, i use it a lot, but honestly I never thought that they would get to delete an entire loop :v ¡Wow!

Collectives™ on Stack Overflow

How to optimize this OpenCL kernel?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related