Parallel reduction with single wave

Question

Following these slides https://www.nvidia.com/content/GTC-2010/pdfs/2260_GTC2010.pdf when doing a parallel reduction you should avoid GroupMemoryBarrierWithGroupSync when reducing the last 2*WaveGetLaneCount() elements

for (unsigned int s=groupDim_x/2; s>32; s>>=1) 
{ 
    if (tid < s) sdata[tid] += sdata[tid + s]; 
    GroupMemoryBarrierWithGroupSync(); 
}
if (tid < 32)
{ 
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid +  8]; 
    sdata[tid] += sdata[tid +  4];
    sdata[tid] += sdata[tid +  2];
    sdata[tid] += sdata[tid +  1]; 
}

But the problem with this is the compiler doesn't write back to shared memory. It essentially does

if (tid < 32)
{ 
    float ldata = sdata[tid];
    ldata += sdata[tid + 32];
    ldata += sdata[tid + 16];
    ldata += sdata[tid +  8]; 
    ldata += sdata[tid +  4];
    ldata += sdata[tid +  2];
    ldata += sdata[tid +  1];
    sdata[tid] = ldata;
}

https://hlsl.godbolt.org/z/9o4r4nvnc

How to fix? One way is to prefix everything with if (tid < s). Is there anything less hacky?

Side note

With Shader Model 6, you should use the wave intrinsics ~~WaveReadLaneAt and WaveGetLaneIndex~~ WaveActiveSum.

Links for Shader Model 6 parallel reductions developer.nvidia.com/blog/faster-parallel-reductions-kepler github.com/b0nes164/GPUPrefixSums/blob/main/GPUPrefixSumsD3D12/… — Tom Huntington
– Tom Huntington, Commented Apr 4 at 22:32

Bizzarrus · Accepted Answer · 2025-04-21 23:16:37Z

2

The simple, but sad, answer is, that this GDC talk is simply outdated and relied on undefined behaviour.

The HLSL specs state:

threadgroup memory is denoted in hlsl with the groupshared keyword. [...] Reads and writes to threadgroup Memory, may occur in any order except as restricted by synchronization intrinsics or other memory annotations.

( https://microsoft.github.io/hlsl-specs/specs/hlsl.html )

So without synchronisation through group barriers, or other memory annotations, the compiler is free to do this sort of optimisation. Older dxc versions might not have done this - however, even if dxc doesn't do this kind of optimisation, it might still be done by the driver when creating the pipeline (the AMD driver in particular likes to do this)

answered Apr 21 at 23:16

Bizzarrus

1,3647 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tom Huntington Apr 21 at 23:34

Do you know any good resources for parrallel programming with warp/wave intrinsics?

Bizzarrus Apr 22 at 1:43

@TomHuntington No, sorry, most of my knowledge comes from reading docs/specs and searching for implementations or examples of specific algorithms on google. The official DirectX Discord Server can also be very helpful, they have a dedicated channel for HLSL programming there to ask questions and get help.

Collectives™ on Stack Overflow

Parallel reduction with single wave

Side note

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Side note

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related