I've got a RWStructeredBuffer<float> tau I'm writing to in a ray generation shader rayGen. The DispatchRays dimension are (tauWidth, tauHeight, 1) and tau has exactly tauWidth * tauHeight elements; each invocation of rayGen writes to a unique element. Now, I also have a compute shader
[numthreads(16, 16, 1)]
void resolve(uint3 dispatchThreadId : SV_DispatchThreadID)
{
execute(dispatchThreadId.xy);
}
rayGen and resolve are executed in order every frame.
execute needs to know the sum tau_sum of all elements in tau. execute is invoked with dimensions (frameWidth, frameHeight, 1) and writes to a RWTexture2D out which has dimensions (frameWidth, frameHeight).
I need to know the fastest way possible to have tau_sum available inside execute. I clearly tried to find this information on the web, but I really was confused about possible solutions with multiple reduction steps and/or reading back to CPU to do (part of) the accumulation.
EDIT: This is what I got right now. My HLSL (actually SLANG) file is:
RWStructuredBuffer<float> tau;
cbuffer reduceTauCB {
const uint stride,
size;
}
[numthreads(256, 1, 1)]
void reduceTau(const uint3 dispatchThreadID: SV_DispatchThreadID) {
tau[stride * i] = tau[stride * i] + tau[stride * i + stride / 2];
}
[numthreads(1, 1, 1)]
void reduceTauFinalize(const uint3 dispatchThreadID: SV_DispatchThreadID)
{
for (uint i = 1; i < size; ++i)
tau[0] += tau[i * stride];
}
and this is the C++ side (using Falcor):
static std::uint32_t constexpr groupSize = 16 * 16;
std::uint32_t const tauCount = mStaticParams.tauWidth * mStaticParams.tauHeight;
auto reduceTauPassRootVar = mpReduceTauPass->getRootVar()["reduceTauCB"];
std::uint32_t stride = 2,
threadCount = tauCount / 2;
while (threadCount >= 256)
{
reduceTauPassRootVar["stride"] = stride;
mpReduceTauPass->execute(pRenderContext, threadCount, 1, 1);
stride *= 2;
threadCount /= 2;
}
reduceTauPassRootVar["stride"] = stride;
reduceTauPassRootVar["size"] = threadCount;
mpReduceTauFinalizePass->execute(pRenderContext, 1, 1, 1);
I wonder if this is really the most performant way to do this. And if it is, is the cbuffer really necessary? Or is there a way to pass my args to the compute shaders directly?
AppendStructuredBuffer. It'll reduce the complexity a bit and you can directly get the number of elements inside the shader by usingGetDimensions.