1

Here is my understanding of the execution pattern of CUDA threads. If a particular thread meets a condition it will execute the kernel. Often the indexing and accesses of each thread is done using its thread and block ID. But, when i came across the following piece of code, i stumbled. As for the correctness, this code gives perfectly correct result.

__global__ void kernel0(int *a)
{
    int b0 = blockIdx.x;
    int t0 = threadIdx.x;
    __shared__ int shared_a[32][33];


    for (int g5 = 0; g5 <= 96; g5 += 32) {
      for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
        for (int c1 = t0; c1 <= min(32, -g5 + 99); c1 += 32)
          shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)];

      __syncthreads();

      if (32 * b0 + t0 <= 99)
        for (int c2 = 0; c2 <= min(31, -g5 + 98); c2 += 1)
          shared_a[t0][c2 + 1] = (shared_a[t0][c2] + 5);

      __syncthreads();

      if (((t0 + 31) % 32) + g5 <= 98)
        for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
          a[(32 * b0 + c0) * 100 + (((t0 + 31) % 32) + g5 + 1)] = shared_a[c0][((t0 + 31) % 32) + 1];

      __syncthreads();

    }
}

My question is which thread-id inside a blocksize of 32 executes the first 3 for-loop's ?

1 Answer 1

1

Short answer

Every thread will execute the for loops, but only threads with index in the interval [0,min(31, -32 * b0 + 99)][t0, c1 <= min(32, -g5 + 99)] do some work at the inner statement, namely

shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)]

About the mapping mechanism

The way you have to assign to each thread its correspondent work is indexing. For example the following statement will be only executed by the thread 0 of each block:

if( threadIdx.x == 0){
// some code
}

While this one will be only execute by the thread with and index 0 a one-dimensional grid:

if( threadIdx.x + blockIdx.x*blockDim.x == 0){
// some code
}

This code (from a simple array reduction) is it also usefull to illustrate such behavior:

for( unsigned int s = 1; s < blockDim.x; s *= 2){
    int index = 2*s*tid;

    if( index < blockDim.x){
        sdata[index] += sdata[index + s];
    }
    __syncthreads();
}

All threads in a block execute the for loop and also all of them have their own value for the index variable. Then, the if statement prevents some threads to execute the addition. Finally the addition is only performed by threads with thread number "index".

As you see this makes some threads to be idle while other could have a lot of work to do (load imbalance), so it is desirable an homogeneous workload across the grid to maximize the performance.

Learning material.

This could be somewhat confusing at first, so I encourage you to read the CUDA C programming guide included in the CUDA toolkit. Play around with the matrix-matrix multiplication, vector addition and vector reduction.

A very comprehensive guide is the "Programming massively parallel processors" book, by David B. Kirk and Wen-mei W. Hwu.

Sign up to request clarification or add additional context in comments.

14 Comments

Thanks for the reply. But, i have a doubt. If all the threads executes the for-loop with block ID b0=0, Will this not give rise to race condition?
I'm not completely sure what you mean, but the blockIdx starts at 0 on every kernel. This is the way to index the very first threads when using the threadIdx.x + blockIdx.x * blockDim.x, for example.
Say there are 32 threads inside block with blockIdx=0, now all these 32 threads inside that block will simultaneously run this part " shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)] ", now shared memory is allocated per block, hence inside a block all the 32 threads will concurrently access and update that shared_a variable.Would this not give rise to race-condition?
No (sorry, you edited your question). The only way to avoid some threads to take part in a section is using things like if(index < something). The inner part of the loops will be only executed by the threads whose index number match the [c0][c1]...
c1 varies by threadID. Therefore the threads are not all updating the same location in shared memory. No race condition.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.