Taking advantage of memory contiguousness in HLSL

Ask Question

Asked 11 days ago

Modified 11 days ago

Viewed 46 times

This is a bit of a slog so bare with me.

I'm currently writing a 3D S(moothed) P(article) H(ydrodynamics) simulation in Unity with a parallel HLSL backend. It's a Lagrangian method of fluid simulation, and hence I have some particles that discretise a continuous field of various properties, i.e. the property is evaluated at each particle's position, and these values are smoothed with a kernel in order to derive the field.

Everything is working great, and I'm now working on optimisation to get the particle count up. Specifically, I've implemented this paper on the topic. The paper gives an efficient spatial partitioning algorithm that scans/scatters particles in parallel so that particles in the same grid cell are stored contiguously in the position buffer:

Discretise the particle position to its grid cell; e.g. in 3D (1, 1, 1)
Hash this grid cell to its table index (using hash function given in paper); e.g. (1, 1, 1) -> 42
Scatter particles into neighbourhood-aware ordering, e.g.:
- 50 particles p_i are in some cell and are hashed to index 42
- The Particles buffer is re-ordered so that at some index, the 50 particles are stored contiguously as [..., p_0, ..., p_i, ..., p_49, ...]
Then when we are doing a neighbourhood search for particles within the kernel radius of some particle a, we just need to iterate over particles between startIndex and endIndex for each of the 27 adjacent cells of whatever cell particle a is in
- Spatial discretisation is done in cells of equal size to kernel radius, so only immediately adjacent cells need to be checked
The startIndex and endIndex of each bin is calculated during the partition/scan/scatter pipeline and is stored in a buffer Offsets such that startIndex = Offsets[hash] and endIndex = Offsets[hash + 1] where hash is the hash of a grid cell
- E.g. for cell (1, 1, 1), startIndex = Offsets(hash(1,1,1)) = 42 and endIndex = Offsets(hash(1,1,1) + 1) = 92

So that's the method, and it is working, the Position and Velocity buffers are ordered this way and reflect each other. Buffers storing other properties (e.g. Density) are derived from these ordered buffers and hence ordering/contiguousness is preserved.

For context I'll give a code snippet of an example neighbourhood search:

void CalculatePCISPHComponents(uint i, out float3 viscosity, out float3 surfaceTension, out float3 pressure, out float3 xsph) {
    viscosity = 0;
    surfaceTension = 0;
    xsph = 0;

    float3 posI = Positions[i];
    int3 gridPosI = GetGridPos(posI);

    float3 velI = Velocities[i];

    for (int x = -1; x < 2; x++) {
        for (int y = -1; y < 2; y++) {
            for (int z = -1; z < 2; z++) {
                int3 gridPosJ = gridPosI + int3(x, y, z);
                if (!IsInBounds(gridPosJ)) continue;

                uint hash = CalculateHashFromGrid(gridPosJ);

                uint startIndex = Offsets[hash];
                uint endIndex = Offsets[hash + 1];

                for (uint j = startIndex; j < endIndex; j++) {
                    if (i == j) continue;

                    float3 posOffset = posI - Positions[j];
                    float rSq = dot(posOffset, posOffset);

                    if (rSq < Epsilon * Epsilon || rSq > 4 * smoothingRadius * smoothingRadius) continue;

                    float r = length(posOffset);
                    float3 velOffset = velI - Velocities[j];
                    
                    float kernel = CubicSplineKernel(r);
                    float3 grad = CubicSplineGrad(posOffset, r);

                    viscosity += CalculateViscosityContribution(posOffset, velOffset, grad, i, j);
                    surfaceTension += -surfaceTensionMultiplier * kernel * (posOffset / r);
                    pressure += CalculatePressureContribution(posOffset, grad, i, j);

                    float massOverDensity = particleMass / Densities[j];
                    xsph += velocitySmoothing * massOverDensity * kernel * -velOffset;
                }
            }
        }
    }
}

This function is called per-particle (one thread per particle) in a dispatch like this (CalculatePCISPHAcceleration just sums the forces spat out by the function above):

[numthreads(NumThreads, 1, 1)]
void CalculateNonPressureAcceleration(uint3 id : SV_DispatchThreadID) {
    if (id.x >= instanceCount) return;

    IntermediateAccelerations[id.x] = CalculatePCISPHAcceleration(id.x);
}

For each of the 27 adjacent-cell searches I'm iterating over a contiguous part of memory containing the relevant buffers, but I'm not seeing the results I would expect, increasing the particle count over ~15K tanks performance and I've seen implementations which can run hundreds of thousands of particles using similar memory access patterns. I suspect I'm missing something or have a fundamental misunderstanding of GPU memory.

I've tried:

Using a cell-based dispatch pattern which copies neighbourhood particles to shared memory first before dispatching threads
Changing hash functions (tried Morton and 3D linearisation as used in the implementation linked above)
Profiling a frame capture completely, which eventually gives this line of compiled code as 16.5% of the compute time for one of the kernels:
```
u_xlat21.xyz = u_xlat1.xyz + (-u_xlat21.xyz);
```
which corresponds to the shader line:
```
float3 posOffset = posI - Positions[j];
```
which is obviously a simple calculation, leading me to believe that the memory access pattern is indeed the bottleneck

Some details:

Running on M4 Metal 10-core GPU
256 threads per thread group (i.e. ceil(instanceCount / 256) dispatches per kernel)
8192 bins

I appreciate that looking for a silver bullet here is unlikely, but would appreciate any insights/suspicions/encouragements of things to try! I hope I've given enough context but the codebase is here if curious.

asked Nov 18 at 14:50

Ben Williams

132 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Taking advantage of memory contiguousness in HLSL

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest