Understanding pattern of execution of CUDA programs with multiple threads in a GPU

Question

Here is my understanding of the execution pattern of CUDA threads. If a particular thread meets a condition it will execute the kernel. Often the indexing and accesses of each thread is done using its thread and block ID. But, when i came across the following piece of code, i stumbled. As for the correctness, this code gives perfectly correct result.

__global__ void kernel0(int *a)
{
    int b0 = blockIdx.x;
    int t0 = threadIdx.x;
    __shared__ int shared_a[32][33];


    for (int g5 = 0; g5 <= 96; g5 += 32) {
      for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
        for (int c1 = t0; c1 <= min(32, -g5 + 99); c1 += 32)
          shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)];

      __syncthreads();

      if (32 * b0 + t0 <= 99)
        for (int c2 = 0; c2 <= min(31, -g5 + 98); c2 += 1)
          shared_a[t0][c2 + 1] = (shared_a[t0][c2] + 5);

      __syncthreads();

      if (((t0 + 31) % 32) + g5 <= 98)
        for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
          a[(32 * b0 + c0) * 100 + (((t0 + 31) % 32) + g5 + 1)] = shared_a[c0][((t0 + 31) % 32) + 1];

      __syncthreads();

    }
}

My question is which thread-id inside a blocksize of 32 executes the first 3 for-loop's ?

srodrb · Accepted Answer · 2014-10-12 18:52:01Z

1

Short answer

Every thread will execute the for loops, but only threads with index in the interval [0,min(31, -32 * b0 + 99)][t0, c1 <= min(32, -g5 + 99)] do some work at the inner statement, namely

shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)]

About the mapping mechanism

The way you have to assign to each thread its correspondent work is indexing. For example the following statement will be only executed by the thread 0 of each block:

if( threadIdx.x == 0){
// some code
}

While this one will be only execute by the thread with and index 0 a one-dimensional grid:

if( threadIdx.x + blockIdx.x*blockDim.x == 0){
// some code
}

This code (from a simple array reduction) is it also usefull to illustrate such behavior:

for( unsigned int s = 1; s < blockDim.x; s *= 2){
    int index = 2*s*tid;

    if( index < blockDim.x){
        sdata[index] += sdata[index + s];
    }
    __syncthreads();
}

All threads in a block execute the for loop and also all of them have their own value for the index variable. Then, the if statement prevents some threads to execute the addition. Finally the addition is only performed by threads with thread number "index".

As you see this makes some threads to be idle while other could have a lot of work to do (load imbalance), so it is desirable an homogeneous workload across the grid to maximize the performance.

Learning material.

This could be somewhat confusing at first, so I encourage you to read the CUDA C programming guide included in the CUDA toolkit. Play around with the matrix-matrix multiplication, vector addition and vector reduction.

A very comprehensive guide is the "Programming massively parallel processors" book, by David B. Kirk and Wen-mei W. Hwu.

edited Oct 12, 2014 at 18:52

answered Oct 10, 2014 at 22:36

srodrb

1,36413 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

Sarkar Over a year ago

Thanks for the reply. But, i have a doubt. If all the threads executes the for-loop with block ID b0=0, Will this not give rise to race condition?

srodrb Over a year ago

I'm not completely sure what you mean, but the blockIdx starts at 0 on every kernel. This is the way to index the very first threads when using the threadIdx.x + blockIdx.x * blockDim.x, for example.

Sarkar Over a year ago

Say there are 32 threads inside block with blockIdx=0, now all these 32 threads inside that block will simultaneously run this part " shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)] ", now shared memory is allocated per block, hence inside a block all the 32 threads will concurrently access and update that shared_a variable.Would this not give rise to race-condition?

srodrb Over a year ago

No (sorry, you edited your question). The only way to avoid some threads to take part in a section is using things like if(index < something). The inner part of the loops will be only executed by the threads whose index number match the [c0][c1]...

Robert Crovella Over a year ago

c1 varies by threadID. Therefore the threads are not all updating the same location in shared memory. No race condition.

|

Collectives™ on Stack Overflow

Understanding pattern of execution of CUDA programs with multiple threads in a GPU

1 Answer 1

14 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

14 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related