0

I'm having a bit of a issue understanding how I could implement CUDA shared memory since I'm not using the thread number for anything other than to check which calculations that certain thread should do.

__global__ void gpu_histogram_equalization(unsigned char * img_out, unsigned char * img_in,
                            int * hist_in, int img_size, int nbr_bin, int numOfThreads, int * lut){


    int i = 0;
    int x = threadIdx.x + blockDim.x*blockIdx.x;

    int start;
    int end;

    /* Get the result image */
    if(x >= img_size) {
       return;
    }
    start = ((img_size/numOfThreads) * x);
    if(numOfThreads == 1) {
       end = (img_size/numOfThreads);
    }
    else {
       end = ((img_size/numOfThreads) * (x+1));
    }
    for(i = start; i < end; i ++){
        if(lut[img_in[i]] > 255){
            img_out[i] = 255;
        }
        else{
            img_out[i] = (unsigned char)lut[img_in[i]];
        }

    }
}

Can anyone clarify that my speculation is true, that this is not possible to make use of shared memory?

3
  • Shared memory will not lead to a performance increase for this example. What you could do is to coalesce the memory access better by removing the loop and letting one thread execute img_out[x] = Min(255,lut[img_in[x]). You could remove all code exept x variable, the boundary check (x >= img_size), and the calculation inside the loop. Commented Dec 9, 2015 at 10:18
  • The reason for no change in performance is due to the fact I am using a thread to perform multiple calculations instead of just one, right? Also could to explain the second part of your answer a bit more, thank you Commented Dec 9, 2015 at 10:25
  • I have posted and answare with code example for you. Commented Dec 9, 2015 at 10:41

1 Answer 1

1

Using shared memory will give you a preformance increase if you reuse the data multiple times. The code can be rewritten to utilize higher memory bandwidth and discard the use of shared memory.

Something like this:

__global__ void gpu_histogram_equalization(unsigned char * img_out, unsigned char * img_in,
                        int * hist_in, int img_size, int nbr_bin, int numOfThreads, int * lut){
  int lutval;
  int x = threadIdx.x + blockDim.x*blockIdx.x;

  /* Get the result image */
  if(x >= img_size) {
     return;
  }

  lutval = lut[img_in[x]];

  if(lutval > 255){
   img_out[x] = 255;
  }
  else{
    img_out[i] = (unsigned char)lutval;
  }
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.