1

Let's say I have a simple kernel that applies two different mathematical operations to each element of the input array. The operation to apply is selected based on the modulo with the element's position in the array:

__global__ void kernel(float* A, int N)
{
    int i = threadIdx.x;    

    if (i < N) {
        float x = A[i];

        if (i % 2 == 0)       x = cosf(sqrtf(x));
        else if (i % 2 == 1)  x = sqrtf(logf(x));

        A[i] = x;
    }    
}

int main()
{
    // some code...

    // Kernel invocation with N threads
    kernel<<<1, N>>>(A, N);

    // ...some more code
}

To improve execution time, is there perhaps a way to by-pass the element-wise modulo computation by instead launching two kernels, one that processes even indices with x = cosf(sqrtf(x)) and one that processes odd indices with x = sqrtf(logf(x))?

4
  • Applogies, I fixed my question. Commented Dec 14, 2021 at 16:39
  • 3
    Why not have each thread do both calculations (and run half as many threads)? Commented Dec 14, 2021 at 17:00
  • Could perhaps please elaborate why you think that would be a better way of doing it? Commented Dec 14, 2021 at 17:03
  • 1
    I've elaborated in my updated answer. Commented Dec 14, 2021 at 18:26

1 Answer 1

3

is there perhaps a way to by-pass the element-wise modulo computation by instead launching two kernels, one that processes even indices with x = cosf(sqrtf(x)) and one that processes odd indices with x = sqrtf(logf(x))?

Yes.

To improve execution time,

I doubt that it would.

If you want an alternating (even/odd index) selection, a very low-cost operation is just:

if (index & 1)  
   //perform odd function
else
   //perform even function

I think it is very unlikely that if you used that kind of low-cost selection, that breaking the operation into two kernels would help.

However if you wanted to do so, it would be fairly simple:

__global__ void even_kernel(float* A, int N)
{
    int i = threadIdx.x*2;    

    if (i < N) {
        float x = A[i];
        x = cosf(sqrtf(x));
        A[i] = x;
    }    
}

__global__ void odd_kernel(float* A, int N)
{
    int i = threadIdx.x*2+1;    

    if (i < N) {
        float x = A[i];
        x = sqrtf(logf(x));
        A[i] = x;
    }    
}

Based on the code you have provided, you could invoke these like so:

int main()
{
    // some code...
    cudaStream_t s1, s2;
    cudaStreamCreate(&s1);
    cudaStreamCreate(&s2);
    // Kernel invocation with N threads
    odd_kernel<<<1, N/2, 0, s1>>>(A, N);
    even_kernel<<<1, N/2, 0, s2>>>(A, N);

    // ...some more code
}

The stream handling isn't really necessary, but for such tiny grids (one block each, N is at most 1024), there might be some small improvement in performance by running them "concurrently".

As suggested in the comments, we can probably address two of the performance criticisms that might be levelled against your original code by having each thread process two elements. If we use a vectorized load for this, then we can:

  • provide for coalesced loads/stores
  • avoid conditional behavior amongst threads

That kernel could look something like this:

__global__ void kernel(float* A, int N)
{
    int i = threadIdx.x;
    float2* A2 = reinterpret_cast<float2 *>(A);    

    if (i < N/2) {
        float2 x = A2[i];

        x.x = cosf(sqrtf(x.x));
        x.y = sqrtf(logf(x.y));

        A2[i] = x;
    }    
}

int main()
{
    // some code...

    // Kernel invocation with N threads
    kernel<<<1, N/2>>>(A, N);

    // ...some more code
}

This has a few assumptions implicit:

  1. The A pointer is properly aligned for float2 access. This would be satisfied, for example, if A is returned (directly) by cudaMalloc.
  2. The total number of elements to process (N) is an even number (and 2048 or less, in this simplistic case.)
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for your answer. I made a misstake in my original posting, the modulo operations were not correct. I have edited my question to correct this misstake.
Would it be possible to add the invocation code for these two kernels? Thank you!
Thank you! Great answer.
N is 2048 or less for the combined kernel?
Correct me if I am wrong, I think one advantage of the modification into two separate kernels might also be be reduced branch divergence that happens by an if statement checking the parity of the thread.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.