1

Following these slides https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf when doing a parallel reduction you should avoid GroupMemoryBarrierWithGroupSync when reducing the last 2*WaveGetLaneCount() elements

for (unsigned int s=groupDim_x/2; s>32; s>>=1) 
{ 
    if (tid < s) sdata[tid] += sdata[tid + s]; 
    GroupMemoryBarrierWithGroupSync(); 
}
if (tid < 32)
{ 
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid +  8]; 
    sdata[tid] += sdata[tid +  4];
    sdata[tid] += sdata[tid +  2];
    sdata[tid] += sdata[tid +  1]; 
}

But the problem with this is the compiler doesn't write back to shared memory. It essentially does

if (tid < 32)
{ 
    float ldata = sdata[tid];
    ldata += sdata[tid + 32];
    ldata += sdata[tid + 16];
    ldata += sdata[tid +  8]; 
    ldata += sdata[tid +  4];
    ldata += sdata[tid +  2];
    ldata += sdata[tid +  1];
    sdata[tid] = ldata;
}

https://hlsl.godbolt.org/z/9o4r4nvnc

How to fix? One way is to prefix everything with if (tid < s). Is there anything less hacky?

Side note

With Shader Model 6, you should use the wave intrinsics WaveReadLaneAt and WaveGetLaneIndex WaveActiveSum.

1

1 Answer 1

2

The simple, but sad, answer is, that this GDC talk is simply outdated and relied on undefined behaviour.

The HLSL specs state:

threadgroup memory is denoted in hlsl with the groupshared keyword. [...] Reads and writes to threadgroup Memory, may occur in any order except as restricted by synchronization intrinsics or other memory annotations.

( https://microsoft.github.io/hlsl-specs/specs/hlsl.html )

So without synchronisation through group barriers, or other memory annotations, the compiler is free to do this sort of optimisation. Older dxc versions might not have done this - however, even if dxc doesn't do this kind of optimisation, it might still be done by the driver when creating the pipeline (the AMD driver in particular likes to do this)

Sign up to request clarification or add additional context in comments.

2 Comments

Do you know any good resources for parrallel programming with warp/wave intrinsics?
@TomHuntington No, sorry, most of my knowledge comes from reading docs/specs and searching for implementations or examples of specific algorithms on google. The official DirectX Discord Server can also be very helpful, they have a dedicated channel for HLSL programming there to ask questions and get help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.