10

I am working on data prefetch in CUDA (Fermi GPU) through C code. Cuda reference manual talks about the prefetching at ptx level code not at C level code.

Can anyone connect me with some documents or something regarding prefetching through cuda code (cu file). Any help would be appreciated.

2 Answers 2

9

According to PTX manual here is how prefetch works in PTX:

enter image description here

You can embed the PTX instructions into the CUDA kernel. Here is a tiny sample from NVIDIA's documentation:

__device__ int cube (int x)
{
  int y;
  asm("{\n\t"                       // use braces for local scope
      " .reg .u32 t1;\n\t"           // temp reg t1,
      " mul.lo.u32 t1, %1, %1;\n\t" // t1 = x * x
      " mul.lo.u32 %0, t1, %1;\n\t" // y = t1 * x
      "}"
      : "=r"(y) : "r" (x));
  return y;
}

You may come to conclude with the following prefetch function in C:

__device__ void prefetch_l1 (unsigned int addr)
{

  asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}

NOTICE: You need the GPU of Compute Capability 2.0 or higher for prefetch. Pass proper compile flags accordingly -arch=sm_20

Sign up to request clarification or add additional context in comments.

2 Comments

can you provide me on some more documentation on how prefetching works, like the explanation of the concept itself.
sure! check this GPGPU prefetching study and go through references to figure out more about the concept: cc.gatech.edu/~hyesoon/lee_taco12.pdf
3

According to this thread, below is the code for different cache prefetching techniques:

#define DEVICE_STATIC_INTRINSIC_QUALIFIERS  static __device__ __forceinline__

#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
#define PXL_GLOBAL_PTR   "l"
#else
#define PXL_GLOBAL_PTR   "r"
#endif

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l1(const void* const ptr)
{
  asm("prefetch.global.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_uniform(const void* const ptr)
{
  asm("prefetchu.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l2(const void* const ptr)
{
  asm("prefetch.global.L2 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

2 Comments

Good lifting. Now we just need an example where these actually provide a benefit.
@tera I have a general rule of thumb: if Nisght Compute lists long scoreboard stalls as the top stall contributer, you're mostly going to benefit from prefetches. This rule of thumb has worked in 9/10 cases for me. It's very important for low occupancy kernels (like when you are limited to a warp or a block).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.