4

In CUDA 5.0, NVIDIA added a "texture object" (cudaTextureObject_t) that makes textures a bit easier to work with. Previously, it was necessary to define textures as global variables.


I followed this NVIDIA example on using the cudaTextureObject_t. It works properly for the 1D case. I tried to extend the example to work on 2D pitched memory:

#define WIDTH 6
#define HEIGHT 2
int width = WIDTH; int height = HEIGHT;
float h_buffer[12] = {1,2,3,4,5,6,7,8,9,10,11,12};
float* d_buffer;
size_t pitch;
cudaMallocPitch(&d_buffer, &pitch, sizeof(float)*width, height);
cudaMemcpy2D(d_buffer, pitch, &h_buffer, sizeof(float)*width, sizeof(float)*width, height, cudaMemcpyHostToDevice);
printf("pitch = %d \n", pitch);

//CUDA 5 texture objects: https://developer.nvidia.com/content/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypePitch2D;
resDesc.res.pitch2D.devPtr = d_buffer;
resDesc.res.pitch2D.pitchInBytes =  pitch;
resDesc.res.pitch2D.width = width;
resDesc.res.pitch2D.height = height;
resDesc.res.pitch2D.desc.f = cudaChannelFormatKindFloat;
resDesc.res.pitch2D.desc.x = 32; // bits per channel 
resDesc.res.pitch2D.desc.y = 32; 
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

To see if the data is indeed accessible through the texture cache, I printed a few bytes in this kernel:

__global__ void printGpu_tex(cudaTextureObject_t tex) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = tex2D<float>(tex, tidy, tidx);
        printf("tex2D<float>(tex, %d, %d) = %f \n", tidy, tidx, x);
    }
}

I expected the output of this to be "1,2,3,...,12." But, it prints "1,7,7,7,...3,9,...":

tex2D<float>(tex, 0, 0) = 1.000000 
tex2D<float>(tex, 0, 1) = 7.000000 
tex2D<float>(tex, 0, 2) = 7.000000 
tex2D<float>(tex, 0, 3) = 7.000000 
tex2D<float>(tex, 0, 4) = 7.000000 
tex2D<float>(tex, 0, 5) = 7.000000 
tex2D<float>(tex, 1, 0) = 3.000000 
tex2D<float>(tex, 1, 1) = 9.000000 
tex2D<float>(tex, 1, 2) = 9.000000 
tex2D<float>(tex, 1, 3) = 9.000000 
tex2D<float>(tex, 1, 4) = 9.000000 
tex2D<float>(tex, 1, 5) = 9.000000 

To verify that the d_buffer data is set up correctly, I also made a "print kernel" for the raw d_buffer array without using the texture cache:

__global__ void printGpu_vanilla(float* d_buffer, int pitch) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = d_buffer[tidy*pitch + tidx];
        printf("d_buffer[%d][%d] = %f \n", tidy, tidx, x);
    }
}

output looks good (unlike the texture cache version):

d_buffer[0][0] = 1.000000 
d_buffer[0][2] = 2.000000 
d_buffer[0][3] = 3.000000 
d_buffer[0][4] = 4.000000 
d_buffer[0][5] = 5.000000 
d_buffer[0][5] = 6.000000 
d_buffer[1][0] = 7.000000 
d_buffer[1][6] = 8.000000 
d_buffer[1][7] = 9.000000 
d_buffer[1][8] = 10.000000 
d_buffer[1][9] = 11.000000 
d_buffer[1][5] = 12.000000 

Any ideas on what might be going wrong with the texture cache version?


Downloads:

1
  • My guess is that part of the problem lies in cudaMallocPitch vs cudaMallocArray. In the old texture cache API, cudaMallocArray was the typical thing to use. But, cudaMallocArray expects a cudaChannelFormatDesc, which appears to be obsolete in the new cudaTextureObject_t interface. Commented May 5, 2013 at 3:08

2 Answers 2

4

Your cudaChannelFormatDesc in resDesc.res.pitch2D.desc is wrong: y should be 0.

To set the FormatDesc right use CreateChannelDesc<>() functions like resDesc.res.pitch2D.desc = cudaCreateChannelDesc<float>(); instead of setting it manually.

resDesc.res.pitch2D.desc.y = 32 would be valid for a float2 texture.

Sign up to request clarification or add additional context in comments.

Comments

0

Except cudaChannelFormatDesc, seems you have one logical problem in your code which is not a big deal, but can be very misleading if you are not cautious. If you want to follow the CUDA thread organization into blocks and grids and the way wraps are scheduled (moreover, if you want your code be consistence with C++ concept of being "row major"), better to consider x as the fastest varying dimension (similar to row major). Since your code shows that y is varying faster that x, more proper way would be switching the indexes in your code:

float x = tex2D<float>(tex, tidx, tidy);
printf("tex2D<float>(tex, %d, %d) = %f \n", tidx, tidy, x);
...
printf("d_buffer[%d][%d] = %f \n", tidx, tidy, x);

Worth to mention once again, it's not a big problem, but meanwhile can be very confusing, specially when you want to integrate this kernel with other parts of your code.

1 Comment

It's a bad manner to DWON VOTE without any reason!! At least you can leave a comment to indicate the error. In this way people will get to know why something is wrong (Knowing something is wrong is part of a science).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.