Following these slides https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf when doing a parallel reduction you should avoid GroupMemoryBarrierWithGroupSync when reducing the last 2*WaveGetLaneCount() elements
for (unsigned int s=groupDim_x/2; s>32; s>>=1)
{
if (tid < s) sdata[tid] += sdata[tid + s];
GroupMemoryBarrierWithGroupSync();
}
if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
But the problem with this is the compiler doesn't write back to shared memory. It essentially does
if (tid < 32)
{
float ldata = sdata[tid];
ldata += sdata[tid + 32];
ldata += sdata[tid + 16];
ldata += sdata[tid + 8];
ldata += sdata[tid + 4];
ldata += sdata[tid + 2];
ldata += sdata[tid + 1];
sdata[tid] = ldata;
}
https://hlsl.godbolt.org/z/9o4r4nvnc
How to fix? One way is to prefix everything with if (tid < s). Is there anything less hacky?
Side note
With Shader Model 6, you should use the wave intrinsics WaveReadLaneAt and WaveGetLaneIndexWaveActiveSum.