0

I've got a RWStructeredBuffer<float> tau I'm writing to in a ray generation shader rayGen. The DispatchRays dimension are (tauWidth, tauHeight, 1) and tau has exactly tauWidth * tauHeight elements; each invocation of rayGen writes to a unique element. Now, I also have a compute shader

[numthreads(16, 16, 1)]
void resolve(uint3 dispatchThreadId : SV_DispatchThreadID)
{
    execute(dispatchThreadId.xy);
}

rayGen and resolve are executed in order every frame. execute needs to know the sum tau_sum of all elements in tau. execute is invoked with dimensions (frameWidth, frameHeight, 1) and writes to a RWTexture2D out which has dimensions (frameWidth, frameHeight).

I need to know the fastest way possible to have tau_sum available inside execute. I clearly tried to find this information on the web, but I really was confused about possible solutions with multiple reduction steps and/or reading back to CPU to do (part of) the accumulation.


EDIT: This is what I got right now. My HLSL (actually SLANG) file is:

RWStructuredBuffer<float> tau;

cbuffer reduceTauCB {
    const uint stride,
        size;
}

[numthreads(256, 1, 1)]
void reduceTau(const uint3 dispatchThreadID: SV_DispatchThreadID) {
    tau[stride * i] = tau[stride * i] + tau[stride * i + stride / 2];
}

[numthreads(1, 1, 1)]
void reduceTauFinalize(const uint3 dispatchThreadID: SV_DispatchThreadID)
{
    for (uint i = 1; i < size; ++i)
        tau[0] += tau[i * stride];
}

and this is the C++ side (using Falcor):

static std::uint32_t constexpr groupSize = 16 * 16;
std::uint32_t const tauCount = mStaticParams.tauWidth * mStaticParams.tauHeight;

auto reduceTauPassRootVar = mpReduceTauPass->getRootVar()["reduceTauCB"];

std::uint32_t stride = 2,
    threadCount = tauCount / 2;
while (threadCount >= 256)
{
    reduceTauPassRootVar["stride"] = stride;
    mpReduceTauPass->execute(pRenderContext, threadCount, 1, 1);

    stride *= 2;
    threadCount /= 2;
}

reduceTauPassRootVar["stride"] = stride;
reduceTauPassRootVar["size"] = threadCount;
mpReduceTauFinalizePass->execute(pRenderContext, 1, 1, 1);

I wonder if this is really the most performant way to do this. And if it is, is the cbuffer really necessary? Or is there a way to pass my args to the compute shaders directly?

1
  • Maybe you should use an AppendStructuredBuffer. It'll reduce the complexity a bit and you can directly get the number of elements inside the shader by using GetDimensions. Commented May 30 at 6:42

2 Answers 2

0

Seems like you are searching for "parallel reduction". Here is a recent post on this topic Parallel reduction with single wave
The best way is to calculate this sum per workgroup, beacuse you can control synchronization inside of workgroup inside of shader.

It will look like this: compute sum of all pairs and store them to one of elements in each pair, then compute sum of each pair of the results given by previous computation. you repeat like this untill you get the final single value.

Also Gpu sorting algorithm is made in simmilar way, but more complex, may be you would like to take a look. https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your answer. But in which sense is different from the approach I described in the question?
its different because you do multiple dispatch, in case of synchronization inside of shader you will do it in 1 dispatch. It would'nt matter that much if everything was parallel, but its is almost never completely parallel, so probably you would prefer sync inside of workgroup and 1 dispatch, not sync for the whole thing and multiple dispatches.
As an addition I will also say, that probably I think you want multiple thread groups, and if so, that will matter even more, beacuse you will need to wait untill all of them finish, and they will not be scheduled all at once usually, so you willl just stall untill every group will finish and wait for end of dispatch, which will be a waste of time. But just in case if 1 workgroup of 256 is your actuall configuration, that you will use, its better to switch to cpu side computations, it will be a lot faster then transfer this data to buffer then dispatch and back. CPU is fast for such numbers.
0

The optimal method depends on your tauWidth / tauHeight numbers. When the product is not too large, it’s often best to do the complete reduction in a single dispatch. You probably want maximum count of thread groups in that shader, the limit is 1024.

Here’s a compute shader which efficiently computes sum of all floats in a single dispatch, dispatch a single thread group of the shader.

StructuredBuffer<float> source: register(t0);
RWStructuredBuffer<float> dest: register(u0);

cbuffer reduceTauCB
{
    // Pass tauWidth*tauHeight in this number
    uint length: packoffset(c0.x);
}

// 1024 is the API limit on FL11.0, D3D11_CS_THREAD_GROUP_MAX_THREADS_PER_GROUP  
static const uint THREADS = 1024;

// Group shared buffer for the reduction
groupshared float localBuffer[ THREADS ];

[numthreads( THREADS, 1, 1 )]
void main( uint thread: SV_GroupIndex )
{
    float acc = 0;
    // The loop below makes sure all memory loads are fully coalesced
    for( uint rsi = thread; rsi < length; rsi += THREADS )
        acc += source[ rsi ];
    
    localBuffer[ thread ] = acc;
    GroupMemoryBarrierWithGroupSync();
    
    [loop]
    for( uint i = ( THREADS / 2 ); i > 1; i /= 2u )
    {
        [branch]        
        if( thread < i )
        {
            acc += localBuffer[ thread + i ];
            localBuffer[ thread ] = acc;
        }
        GroupMemoryBarrierWithGroupSync();
    }
    
    [branch]        
    if( 0 == thread )
    {
        acc += localBuffer[ 1 ];
        dest[ 0 ] = acc;
    }
}

The first loop accumulates into 1024 local variables one per thread, and the loop is implemented in a way so the memory loads are fully coalesced i.e. threads inside wavefronts are loading from sequential addresses of the global memory. The second loop reduces these 1024 numbers into a single one using a buffer in the fast in-core memory. BTW, local variables are even faster than group shared memory, that’s why the first loop only updates the local variable.

If your array is large, a dual pass version would work better. An easy way for 2D arrays, allocate a temporary buffer of tauHeight floats, dispatch tauHeight thread groups of compute shader with 64 threads per group. Here’s the first pass of the reduction, the second pass is identical to a single pass shader.

StructuredBuffer<float> source: register(t0);
RWStructuredBuffer<float> dest: register(u0);

cbuffer reduceTauCB
{
    uint width: packoffset(c0.x);
}

// Change to 32 if you don't care about AMD older then RDNA
static const uint THREADS = 64;

// Group shared buffer for the reduction
groupshared float localBuffer[ THREADS ];

[numthreads( THREADS, 1, 1 )]
void main( uint3 group: SV_GroupID, uint thread: SV_GroupIndex )
{
    float acc = 0;
    uint rsi = group.x * width;
    const uint rsiEnd = rsi + width;
    for( rsi += thread; rsi < rsiEnd; rsi += THREADS )
        acc += source[ rsi ];
    
    localBuffer[ thread ] = acc;
    GroupMemoryBarrierWithGroupSync();
    
    [loop]
    for( uint i = ( THREADS / 2 ); i > 1; i /= 2u )
    {
        [branch]        
        if( thread < i )
        {
            acc += localBuffer[ thread + i ];
            localBuffer[ thread ] = acc;
        }
        GroupMemoryBarrierWithGroupSync();
    }
    
    [branch]        
    if( 0 == thread )
    {
        acc += localBuffer[ 1 ];
        dest[ group.x ] = acc;
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.