1

I have the following problem. I am trying to divide a shared array into smaller arrays and then use these arrays in other device functions. In my kernel function I do,

for (int block_x = 0; block_x < blockDim.x; block_x++) {
  for (int block_y = 0; block_y < blockDim.y; block_y++) {
    //set up shared memory block
    extern __shared__ vec3f share[];
    vec3f *sh_pos = share;
    vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
    vec3f *sh_density = &sh_velocity[blockDim.x*blockDim.y];
    vec3f *sh_pressure = &sh_density[blockDim.x*blockDim.y];
    //index by 2d threadidx's
    unsigned int index = (block_x * blockDim.x + threadIdx.x) + blockDim.x * gridDim.x * (block_y * blockDim.y + threadIdx.y);
    sh_pos[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].position();
    sh_velocity[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].velocity();
    sh_pressure[blockDim.x * threadIdx.x + threadIdx.y].x = oldParticles[index].pressure();
    sh_density[blockDim.x * threadIdx.x + threadIdx.y].x = oldParticles[index].density();
    __syncthreads();
    d_force_pressure(oldParticles[arr_pos],c_kernel_support);
    __syncthreads();
  }
}

As far as I can tell, all the sh_ arrays get filled with zeros and not the values that I want. I can't tell what I am doing wrong. Note that vec3f is vector of floats just like the float3 datatype. Also, I didn't think I could mix in floats for density and pressure so I just made them vectors and am using a single component. Then, for example my d_force_pressure function is,

__device__ void d_force_pressure(particle& d_particle, float h) {
  extern __shared__ vec3f share[];
  vec3f *sh_pos = share;
  vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
  vec3f *sh_density = &sh_velocity[blockDim.x*blockDim.y];
  vec3f *sh_pressure = &sh_density[blockDim.x*blockDim.y];
  for (int i = 0; i < blockDim.x * blockDim.y; i++) {
    vec3f diffPos = d_particle.position() - sh_pos[i];
    d_particle.force() += GradFuncion(diffPos,h) * -1.0 * c_particle_mass *  (d_particle.pressure()+sh_pressure[i].x)/(2.0*sh_density[i].x);
  }  
}

After calls to this function I get NaNs since I am dividing by zero (sh_density[i].x is as far as I can tell, 0). Also is this in general, the correct way to load shared memory?

Kernel is called by

dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), (int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), 1);
int sharedMemSize = block.x*block.y*4*sizeof(vec3f);
force_kernel<<< grid,block,sharedMemSize  >>>(particle_ptrs[1],particle_ptrs[0],time_step);
3
  • How are you calling the kernel? Commented May 25, 2013 at 10:17
  • @talonmies I made an edited showing my kernel call. Commented May 25, 2013 at 10:23
  • Are you sure that there are no CUDA runtime errors occurring when you try running your code? Do you have comprehensive API and kernel launch error checking as per this? Commented May 25, 2013 at 14:33

2 Answers 2

1

This is kind of a follow-up answer.

As per the comments made by @RobertCrovella, I went on to run cuda-memcheck. Believe it or not, this actually showed no errors. However, when I changed a constant in my code (that controls the sizing of some arrays) cuda-memcheck showed errors related to the question posted here write error. This made me re-check the way I was filling the shared arrays. Basically what needed changing was the

for (int block_x = 0; block_x < blockDim.x; block_x++) {
  for (int block_y = 0; block_y < blockDim.y; block_y++) {

to

for (int block_x = 0; block_x < gridDim.x; block_x++) {
  for (int block_y = 0; block_y < gridDim.y; block_y++) {

I believe that this then gives the right position for the index variable. I basically learned that whenever you are using shared memory and notice things running slow, its a good idea to use cuda-memcheck.

Sign up to request clarification or add additional context in comments.

Comments

0

I indicated in your previous question that you don't want to do this:

dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), (int)ceil(sqrt(float(max_particles)) / (float(block.x*block.y))), 1);

you want to do this:

dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x))), (int)ceil(sqrt(float(max_particles)) / (float(block.y))), 1);

The x grid direction should be scaled by the threadblock x dimension, not the threadblock x dimension * threadblock y dimension. However the code I posted in my previous answer also had this error, even though I pointed it out in the comments, I forgot to fix it.

Furthermore, this indexing doesn't look right to me:

sh_velocity[blockDim.x * threadIdx.x + threadIdx.y] 

I think it should be:

sh_velocity[blockDim.x * threadIdx.y + threadIdx.x] 

You have several examples of that.

You haven't posted a complete executable. There certainly may be more issues than the ones I've pointed out above. If I have to go through all the vec3f -> float3 conversion work I did in your last question, well, someone else can help you then. If you write a simple reproducer that doesn't depend on a bunch of code I don't have, I can try to help further. More than likely, if you do that, you'll discover the problem yourself.

Have you put cuda error checking into your code like I suggested in my last answer?

You might also want to run your code through cuda-memcheck:

cuda-memcheck ./mycode

1 Comment

I think I figured it out. It primarily had to do with the grid setup (I used the fix that you posted above). I also made the changes you suggested for indexing the sh_ arrays. I feel like the two indexing methods are equivalent for a square block (which is what I am using) but differ if the block is rectangular. I did implement error-checking. It seems like things are working right now. I'll re-post if I run into any more issues. The simulation runs slow but I think that's because of my gpu which is a 6ish year old 8600M GT and the plethora of N^2 calculations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.