This is a bit of a slog so bare with me.
I'm currently writing a 3D S(moothed) P(article) H(ydrodynamics) simulation in Unity with a parallel HLSL backend. It's a Lagrangian method of fluid simulation, and hence I have some particles that discretise a continuous field of various properties, i.e. the property is evaluated at each particle's position, and these values are smoothed with a kernel in order to derive the field.
Everything is working great, and I'm now working on optimisation to get the particle count up. Specifically, I've implemented this paper on the topic. The paper gives an efficient spatial partitioning algorithm that scans/scatters particles in parallel so that particles in the same grid cell are stored contiguously in the position buffer:
Discretise the particle position to its grid cell; e.g. in 3D (1, 1, 1)
Hash this grid cell to its table index (using hash function given in paper); e.g. (1, 1, 1) -> 42
Scatter particles into neighbourhood-aware ordering, e.g.:
50 particles pi are in some cell and are hashed to index 42
The
Particlesbuffer is re-ordered so that at some index, the 50 particles are stored contiguously as[..., p_0, ..., p_i, ..., p_49, ...]
Then when we are doing a neighbourhood search for particles within the kernel radius of some particle
a, we just need to iterate over particles betweenstartIndexandendIndexfor each of the 27 adjacent cells of whatever cell particleais in- Spatial discretisation is done in cells of equal size to kernel radius, so only immediately adjacent cells need to be checked
The
startIndexandendIndexof each bin is calculated during the partition/scan/scatter pipeline and is stored in a bufferOffsetssuch thatstartIndex = Offsets[hash]andendIndex = Offsets[hash + 1]where hash is the hash of a grid cell- E.g. for cell (1, 1, 1),
startIndex = Offsets(hash(1,1,1)) = 42andendIndex = Offsets(hash(1,1,1) + 1) = 92
- E.g. for cell (1, 1, 1),
So that's the method, and it is working, the Position and Velocity buffers are ordered this way and reflect each other. Buffers storing other properties (e.g. Density) are derived from these ordered buffers and hence ordering/contiguousness is preserved.
For context I'll give a code snippet of an example neighbourhood search:
void CalculatePCISPHComponents(uint i, out float3 viscosity, out float3 surfaceTension, out float3 pressure, out float3 xsph) {
viscosity = 0;
surfaceTension = 0;
xsph = 0;
float3 posI = Positions[i];
int3 gridPosI = GetGridPos(posI);
float3 velI = Velocities[i];
for (int x = -1; x < 2; x++) {
for (int y = -1; y < 2; y++) {
for (int z = -1; z < 2; z++) {
int3 gridPosJ = gridPosI + int3(x, y, z);
if (!IsInBounds(gridPosJ)) continue;
uint hash = CalculateHashFromGrid(gridPosJ);
uint startIndex = Offsets[hash];
uint endIndex = Offsets[hash + 1];
for (uint j = startIndex; j < endIndex; j++) {
if (i == j) continue;
float3 posOffset = posI - Positions[j];
float rSq = dot(posOffset, posOffset);
if (rSq < Epsilon * Epsilon || rSq > 4 * smoothingRadius * smoothingRadius) continue;
float r = length(posOffset);
float3 velOffset = velI - Velocities[j];
float kernel = CubicSplineKernel(r);
float3 grad = CubicSplineGrad(posOffset, r);
viscosity += CalculateViscosityContribution(posOffset, velOffset, grad, i, j);
surfaceTension += -surfaceTensionMultiplier * kernel * (posOffset / r);
pressure += CalculatePressureContribution(posOffset, grad, i, j);
float massOverDensity = particleMass / Densities[j];
xsph += velocitySmoothing * massOverDensity * kernel * -velOffset;
}
}
}
}
}
This function is called per-particle (one thread per particle) in a dispatch like this (CalculatePCISPHAcceleration just sums the forces spat out by the function above):
[numthreads(NumThreads, 1, 1)]
void CalculateNonPressureAcceleration(uint3 id : SV_DispatchThreadID) {
if (id.x >= instanceCount) return;
IntermediateAccelerations[id.x] = CalculatePCISPHAcceleration(id.x);
}
For each of the 27 adjacent-cell searches I'm iterating over a contiguous part of memory containing the relevant buffers, but I'm not seeing the results I would expect, increasing the particle count over ~15K tanks performance and I've seen implementations which can run hundreds of thousands of particles using similar memory access patterns. I suspect I'm missing something or have a fundamental misunderstanding of GPU memory.
I've tried:
Using a cell-based dispatch pattern which copies neighbourhood particles to shared memory first before dispatching threads
Changing hash functions (tried Morton and 3D linearisation as used in the implementation linked above)
Profiling a frame capture completely, which eventually gives this line of compiled code as 16.5% of the compute time for one of the kernels:
u_xlat21.xyz = u_xlat1.xyz + (-u_xlat21.xyz);which corresponds to the shader line:
float3 posOffset = posI - Positions[j];which is obviously a simple calculation, leading me to believe that the memory access pattern is indeed the bottleneck
Some details:
Running on M4 Metal 10-core GPU
256 threads per thread group (i.e.
ceil(instanceCount / 256)dispatches per kernel)8192 bins
I appreciate that looking for a silver bullet here is unlikely, but would appreciate any insights/suspicions/encouragements of things to try! I hope I've given enough context but the codebase is here if curious.