New CUDA Texture Object -- getting wrong data in 2D case

Question

In CUDA 5.0, NVIDIA added a "texture object" (cudaTextureObject_t) that makes textures a bit easier to work with. Previously, it was necessary to define textures as global variables.

I followed this NVIDIA example on using the cudaTextureObject_t. It works properly for the 1D case. I tried to extend the example to work on 2D pitched memory:

#define WIDTH 6
#define HEIGHT 2
int width = WIDTH; int height = HEIGHT;
float h_buffer[12] = {1,2,3,4,5,6,7,8,9,10,11,12};
float* d_buffer;
size_t pitch;
cudaMallocPitch(&d_buffer, &pitch, sizeof(float)*width, height);
cudaMemcpy2D(d_buffer, pitch, &h_buffer, sizeof(float)*width, sizeof(float)*width, height, cudaMemcpyHostToDevice);
printf("pitch = %d \n", pitch);

//CUDA 5 texture objects: https://developer.nvidia.com/content/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypePitch2D;
resDesc.res.pitch2D.devPtr = d_buffer;
resDesc.res.pitch2D.pitchInBytes =  pitch;
resDesc.res.pitch2D.width = width;
resDesc.res.pitch2D.height = height;
resDesc.res.pitch2D.desc.f = cudaChannelFormatKindFloat;
resDesc.res.pitch2D.desc.x = 32; // bits per channel 
resDesc.res.pitch2D.desc.y = 32; 
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

To see if the data is indeed accessible through the texture cache, I printed a few bytes in this kernel:

__global__ void printGpu_tex(cudaTextureObject_t tex) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = tex2D<float>(tex, tidy, tidx);
        printf("tex2D<float>(tex, %d, %d) = %f \n", tidy, tidx, x);
    }
}

I expected the output of this to be "1,2,3,...,12." But, it prints "1,7,7,7,...3,9,...":

tex2D<float>(tex, 0, 0) = 1.000000 
tex2D<float>(tex, 0, 1) = 7.000000 
tex2D<float>(tex, 0, 2) = 7.000000 
tex2D<float>(tex, 0, 3) = 7.000000 
tex2D<float>(tex, 0, 4) = 7.000000 
tex2D<float>(tex, 0, 5) = 7.000000 
tex2D<float>(tex, 1, 0) = 3.000000 
tex2D<float>(tex, 1, 1) = 9.000000 
tex2D<float>(tex, 1, 2) = 9.000000 
tex2D<float>(tex, 1, 3) = 9.000000 
tex2D<float>(tex, 1, 4) = 9.000000 
tex2D<float>(tex, 1, 5) = 9.000000

To verify that the d_buffer data is set up correctly, I also made a "print kernel" for the raw d_buffer array without using the texture cache:

__global__ void printGpu_vanilla(float* d_buffer, int pitch) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = d_buffer[tidy*pitch + tidx];
        printf("d_buffer[%d][%d] = %f \n", tidy, tidx, x);
    }
}

output looks good (unlike the texture cache version):

d_buffer[0][0] = 1.000000 
d_buffer[0][2] = 2.000000 
d_buffer[0][3] = 3.000000 
d_buffer[0][4] = 4.000000 
d_buffer[0][5] = 5.000000 
d_buffer[0][5] = 6.000000 
d_buffer[1][0] = 7.000000 
d_buffer[1][6] = 8.000000 
d_buffer[1][7] = 9.000000 
d_buffer[1][8] = 10.000000 
d_buffer[1][9] = 11.000000 
d_buffer[1][5] = 12.000000

Any ideas on what might be going wrong with the texture cache version?

Downloads:

My guess is that part of the problem lies in cudaMallocPitch vs cudaMallocArray. In the old texture cache API, cudaMallocArray was the typical thing to use. But, cudaMallocArray expects a cudaChannelFormatDesc, which appears to be obsolete in the new cudaTextureObject_t interface. — solvingPuzzles
– solvingPuzzles, Commented May 5, 2013 at 3:08

Mohsen · Accepted Answer · 2019-02-12 22:24:42Z

4

Your cudaChannelFormatDesc in resDesc.res.pitch2D.desc is wrong: y should be 0.

To set the FormatDesc right use CreateChannelDesc<>() functions like resDesc.res.pitch2D.desc = cudaCreateChannelDesc<float>(); instead of setting it manually.

resDesc.res.pitch2D.desc.y = 32 would be valid for a float2 texture.

edited Feb 12, 2019 at 22:24

Mohsen

1831 silver badge11 bronze badges

answered May 5, 2013 at 17:05

kunzmi

1,0241 gold badge6 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mohsen · Accepted Answer · 2019-02-26 13:39:42Z

0

Except cudaChannelFormatDesc, seems you have one logical problem in your code which is not a big deal, but can be very misleading if you are not cautious. If you want to follow the CUDA thread organization into blocks and grids and the way wraps are scheduled (moreover, if you want your code be consistence with C++ concept of being "row major"), better to consider x as the fastest varying dimension (similar to row major). Since your code shows that y is varying faster that x, more proper way would be switching the indexes in your code:

float x = tex2D<float>(tex, tidx, tidy);
printf("tex2D<float>(tex, %d, %d) = %f \n", tidx, tidy, x);
...
printf("d_buffer[%d][%d] = %f \n", tidx, tidy, x);

Worth to mention once again, it's not a big problem, but meanwhile can be very confusing, specially when you want to integrate this kernel with other parts of your code.

answered Feb 26, 2019 at 13:39

Mohsen

1831 silver badge11 bronze badges

1 Comment

Mohsen Over a year ago

It's a bad manner to DWON VOTE without any reason!! At least you can leave a comment to indicate the error. In this way people will get to know why something is wrong (Knowing something is wrong is part of a science).

Collectives™ on Stack Overflow

New CUDA Texture Object -- getting wrong data in 2D case

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related