Execute different kernel for different parts of input array

Question

Let's say I have a simple kernel that applies two different mathematical operations to each element of the input array. The operation to apply is selected based on the modulo with the element's position in the array:

__global__ void kernel(float* A, int N)
{
    int i = threadIdx.x;    

    if (i < N) {
        float x = A[i];

        if (i % 2 == 0)       x = cosf(sqrtf(x));
        else if (i % 2 == 1)  x = sqrtf(logf(x));

        A[i] = x;
    }    
}

int main()
{
    // some code...

    // Kernel invocation with N threads
    kernel<<<1, N>>>(A, N);

    // ...some more code
}

To improve execution time, is there perhaps a way to by-pass the element-wise modulo computation by instead launching two kernels, one that processes even indices with x = cosf(sqrtf(x)) and one that processes odd indices with x = sqrtf(logf(x))?

Why not have each thread do both calculations (and run half as many threads)? — talonmies
– talonmies, Commented Dec 14, 2021 at 17:00
Could perhaps please elaborate why you think that would be a better way of doing it? — Smajjk
– Smajjk, Commented Dec 14, 2021 at 17:03

6 revs, 2 users 99% · Accepted Answer · 2021-12-15 18:28:13Z

3

is there perhaps a way to by-pass the element-wise modulo computation by instead launching two kernels, one that processes even indices with x = cosf(sqrtf(x)) and one that processes odd indices with x = sqrtf(logf(x))?

Yes.

To improve execution time,

I doubt that it would.

If you want an alternating (even/odd index) selection, a very low-cost operation is just:

if (index & 1)  
   //perform odd function
else
   //perform even function

I think it is very unlikely that if you used that kind of low-cost selection, that breaking the operation into two kernels would help.

However if you wanted to do so, it would be fairly simple:

__global__ void even_kernel(float* A, int N)
{
    int i = threadIdx.x*2;    

    if (i < N) {
        float x = A[i];
        x = cosf(sqrtf(x));
        A[i] = x;
    }    
}

__global__ void odd_kernel(float* A, int N)
{
    int i = threadIdx.x*2+1;    

    if (i < N) {
        float x = A[i];
        x = sqrtf(logf(x));
        A[i] = x;
    }    
}

Based on the code you have provided, you could invoke these like so:

int main()
{
    // some code...
    cudaStream_t s1, s2;
    cudaStreamCreate(&s1);
    cudaStreamCreate(&s2);
    // Kernel invocation with N threads
    odd_kernel<<<1, N/2, 0, s1>>>(A, N);
    even_kernel<<<1, N/2, 0, s2>>>(A, N);

    // ...some more code
}

The stream handling isn't really necessary, but for such tiny grids (one block each, N is at most 1024), there might be some small improvement in performance by running them "concurrently".

As suggested in the comments, we can probably address two of the performance criticisms that might be levelled against your original code by having each thread process two elements. If we use a vectorized load for this, then we can:

provide for coalesced loads/stores
avoid conditional behavior amongst threads

That kernel could look something like this:

__global__ void kernel(float* A, int N)
{
    int i = threadIdx.x;
    float2* A2 = reinterpret_cast<float2 *>(A);    

    if (i < N/2) {
        float2 x = A2[i];

        x.x = cosf(sqrtf(x.x));
        x.y = sqrtf(logf(x.y));

        A2[i] = x;
    }    
}

int main()
{
    // some code...

    // Kernel invocation with N threads
    kernel<<<1, N/2>>>(A, N);

    // ...some more code
}

This has a few assumptions implicit:

The A pointer is properly aligned for float2 access. This would be satisfied, for example, if A is returned (directly) by cudaMalloc.
The total number of elements to process (N) is an even number (and 2048 or less, in this simplistic case.)

edited Dec 15, 2021 at 18:28

community wiki

6 revs, 2 users 99%
Robert Crovella

Sign up to request clarification or add additional context in comments.

5 Comments

Smajjk Over a year ago

Thank you for your answer. I made a misstake in my original posting, the modulo operations were not correct. I have edited my question to correct this misstake.

Smajjk Over a year ago

Would it be possible to add the invocation code for these two kernels? Thank you!

Smajjk Over a year ago

Thank you! Great answer.

Sebastian Over a year ago

N is 2048 or less for the combined kernel?

Danish A. Alvi Over a year ago

Correct me if I am wrong, I think one advantage of the modification into two separate kernels might also be be reduced branch divergence that happens by an if statement checking the parity of the thread.

Collectives™ on Stack Overflow

Execute different kernel for different parts of input array

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related