is there perhaps a way to by-pass the element-wise modulo computation by instead launching two kernels, one that processes even indices with x = cosf(sqrtf(x)) and one that processes odd indices with x = sqrtf(logf(x))?
Yes.
To improve execution time,
I doubt that it would.
If you want an alternating (even/odd index) selection, a very low-cost operation is just:
if (index & 1)
//perform odd function
else
//perform even function
I think it is very unlikely that if you used that kind of low-cost selection, that breaking the operation into two kernels would help.
However if you wanted to do so, it would be fairly simple:
__global__ void even_kernel(float* A, int N)
{
int i = threadIdx.x*2;
if (i < N) {
float x = A[i];
x = cosf(sqrtf(x));
A[i] = x;
}
}
__global__ void odd_kernel(float* A, int N)
{
int i = threadIdx.x*2+1;
if (i < N) {
float x = A[i];
x = sqrtf(logf(x));
A[i] = x;
}
}
Based on the code you have provided, you could invoke these like so:
int main()
{
// some code...
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
// Kernel invocation with N threads
odd_kernel<<<1, N/2, 0, s1>>>(A, N);
even_kernel<<<1, N/2, 0, s2>>>(A, N);
// ...some more code
}
The stream handling isn't really necessary, but for such tiny grids (one block each, N is at most 1024), there might be some small improvement in performance by running them "concurrently".
As suggested in the comments, we can probably address two of the performance criticisms that might be levelled against your original code by having each thread process two elements. If we use a vectorized load for this, then we can:
- provide for coalesced loads/stores
- avoid conditional behavior amongst threads
That kernel could look something like this:
__global__ void kernel(float* A, int N)
{
int i = threadIdx.x;
float2* A2 = reinterpret_cast<float2 *>(A);
if (i < N/2) {
float2 x = A2[i];
x.x = cosf(sqrtf(x.x));
x.y = sqrtf(logf(x.y));
A2[i] = x;
}
}
int main()
{
// some code...
// Kernel invocation with N threads
kernel<<<1, N/2>>>(A, N);
// ...some more code
}
This has a few assumptions implicit:
- The
A pointer is properly aligned for float2 access. This would be satisfied, for example, if A is returned (directly) by cudaMalloc.
- The total number of elements to process (
N) is an even number (and 2048 or less, in this simplistic case.)