Here is my understanding of the execution pattern of CUDA threads. If a particular thread meets a condition it will execute the kernel. Often the indexing and accesses of each thread is done using its thread and block ID. But, when i came across the following piece of code, i stumbled. As for the correctness, this code gives perfectly correct result.
__global__ void kernel0(int *a)
{
int b0 = blockIdx.x;
int t0 = threadIdx.x;
__shared__ int shared_a[32][33];
for (int g5 = 0; g5 <= 96; g5 += 32) {
for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
for (int c1 = t0; c1 <= min(32, -g5 + 99); c1 += 32)
shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)];
__syncthreads();
if (32 * b0 + t0 <= 99)
for (int c2 = 0; c2 <= min(31, -g5 + 98); c2 += 1)
shared_a[t0][c2 + 1] = (shared_a[t0][c2] + 5);
__syncthreads();
if (((t0 + 31) % 32) + g5 <= 98)
for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
a[(32 * b0 + c0) * 100 + (((t0 + 31) % 32) + g5 + 1)] = shared_a[c0][((t0 + 31) % 32) + 1];
__syncthreads();
}
}
My question is which thread-id inside a blocksize of 32 executes the first 3 for-loop's ?