Prefetch in cuda (through C code)

Question

I am working on data prefetch in CUDA (Fermi GPU) through C code. Cuda reference manual talks about the prefetching at ptx level code not at C level code.

Can anyone connect me with some documents or something regarding prefetching through cuda code (cu file). Any help would be appreciated.

lashgar · Accepted Answer · 2012-11-07 11:55:07Z

9

According to PTX manual here is how prefetch works in PTX:

enter image description here

You can embed the PTX instructions into the CUDA kernel. Here is a tiny sample from NVIDIA's documentation:

__device__ int cube (int x)
{
  int y;
  asm("{\n\t"                       // use braces for local scope
      " .reg .u32 t1;\n\t"           // temp reg t1,
      " mul.lo.u32 t1, %1, %1;\n\t" // t1 = x * x
      " mul.lo.u32 %0, t1, %1;\n\t" // y = t1 * x
      "}"
      : "=r"(y) : "r" (x));
  return y;
}

You may come to conclude with the following prefetch function in C:

__device__ void prefetch_l1 (unsigned int addr)
{

  asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}

NOTICE: You need the GPU of Compute Capability 2.0 or higher for prefetch. Pass proper compile flags accordingly -arch=sm_20

edited Nov 7, 2012 at 11:55

answered Nov 7, 2012 at 10:35

lashgar

5,5984 gold badges39 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Fady Kamal Over a year ago

can you provide me on some more documentation on how prefetching works, like the explanation of the concept itself.

lashgar Over a year ago

sure! check this GPGPU prefetching study and go through references to figure out more about the concept: cc.gatech.edu/~hyesoon/lee_taco12.pdf

Serge Rogatch · Accepted Answer · 2019-07-28 11:50:59Z

3

According to this thread, below is the code for different cache prefetching techniques:

#define DEVICE_STATIC_INTRINSIC_QUALIFIERS  static __device__ __forceinline__

#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
#define PXL_GLOBAL_PTR   "l"
#else
#define PXL_GLOBAL_PTR   "r"
#endif

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l1(const void* const ptr)
{
  asm("prefetch.global.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_uniform(const void* const ptr)
{
  asm("prefetchu.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

DEVICE_STATIC_INTRINSIC_QUALIFIERS void __prefetch_global_l2(const void* const ptr)
{
  asm("prefetch.global.L2 [%0];" : : PXL_GLOBAL_PTR(ptr));
}

answered Jul 28, 2019 at 11:50

Serge Rogatch

15.3k9 gold badges101 silver badges186 bronze badges

2 Comments

tera Over a year ago

Good lifting. Now we just need an example where these actually provide a benefit.

Yashas Over a year ago

@tera I have a general rule of thumb: if Nisght Compute lists long scoreboard stalls as the top stall contributer, you're mostly going to benefit from prefetches. This rule of thumb has worked in 9/10 cases for me. It's very important for low occupancy kernels (like when you are limited to a warp or a block).

Collectives™ on Stack Overflow

Prefetch in cuda (through C code)

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related