CUDA Shared Memory Possibility

Question

I'm having a bit of a issue understanding how I could implement CUDA shared memory since I'm not using the thread number for anything other than to check which calculations that certain thread should do.

__global__ void gpu_histogram_equalization(unsigned char * img_out, unsigned char * img_in,
                            int * hist_in, int img_size, int nbr_bin, int numOfThreads, int * lut){


    int i = 0;
    int x = threadIdx.x + blockDim.x*blockIdx.x;

    int start;
    int end;

    /* Get the result image */
    if(x >= img_size) {
       return;
    }
    start = ((img_size/numOfThreads) * x);
    if(numOfThreads == 1) {
       end = (img_size/numOfThreads);
    }
    else {
       end = ((img_size/numOfThreads) * (x+1));
    }
    for(i = start; i < end; i ++){
        if(lut[img_in[i]] > 255){
            img_out[i] = 255;
        }
        else{
            img_out[i] = (unsigned char)lut[img_in[i]];
        }

    }
}

Can anyone clarify that my speculation is true, that this is not possible to make use of shared memory?

Shared memory will not lead to a performance increase for this example. What you could do is to coalesce the memory access better by removing the loop and letting one thread execute img_out[x] = Min(255,lut[img_in[x]). You could remove all code exept x variable, the boundary check (x >= img_size), and the calculation inside the loop. — brano
– brano, Commented Dec 9, 2015 at 10:18
The reason for no change in performance is due to the fact I am using a thread to perform multiple calculations instead of just one, right? Also could to explain the second part of your answer a bit more, thank you — QQCuda
– QQCuda, Commented Dec 9, 2015 at 10:25

brano · Accepted Answer · 2015-12-09 10:39:29Z

1

Using shared memory will give you a preformance increase if you reuse the data multiple times. The code can be rewritten to utilize higher memory bandwidth and discard the use of shared memory.

Something like this:

__global__ void gpu_histogram_equalization(unsigned char * img_out, unsigned char * img_in,
                        int * hist_in, int img_size, int nbr_bin, int numOfThreads, int * lut){
  int lutval;
  int x = threadIdx.x + blockDim.x*blockIdx.x;

  /* Get the result image */
  if(x >= img_size) {
     return;
  }

  lutval = lut[img_in[x]];

  if(lutval > 255){
   img_out[x] = 255;
  }
  else{
    img_out[i] = (unsigned char)lutval;
  }
}

answered Dec 9, 2015 at 10:39

brano

2,89421 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

CUDA Shared Memory Possibility

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related