Parallelism in GPU's rasterization process

Question

Based on this article I understood the base principle inside the Rasterization algorithm:

For every triangle
Compute Projection, color at vertices
Setup line equations
Compute bbox, clip bbox to screen limits
For all pixels in bbox
Increment line equations
Compute curentZ
Compute currentColor
If all line equations>0 //pixel [x,y] in triangle
If currentZ<zBuffer[x,y] //pixel is visible
 Framebuffer[x,y]=currentColor
zBuffer[x,y]=currentZ

What I don't understand is how it is implemented in parallel inside the GPU.

I consider 2 possible implementations of the algorithm inside the GPU:

The first way is drawing every triangle one after the other(in 1 thread) as all of the pixels inside of them are ran in parallel. What bothers me about this way is that it's really slow for a large amount of triangles.
The second way is drawing every triangle in parallel as for every triangle all the pixels inside of it are ran in parallel too. This looks efficient to me, but I see a problem in the way zBuffer and Framebuffer data are synced as if 2 pixels try to occupy 1 spot there will be 2 threads trying to write on same data at same time. Considering that there are 2 buffers that need to be updated I don't see a way it could happen atomicly.

Another observation I have is that when I draw 2 triangles at same coordinates it's always the last that gets drawn. This stops me from thinking that there's some atomic way of doing pixel calculations as if there were the output pixels of the triangle would be random based of the 2 input triangle colors.

The thing that I guess is that the implementation is something in the middle of my 2 guesses, but it's there were I give up and ask here.

\$\begingroup\$ bump anyone ? ... \$\endgroup\$

user2377766
– user2377766

2017-05-16 19:34:43 +00:00
Commented May 16, 2017 at 19:34 — user2377766
– user2377766, Commented May 16, 2017 at 19:34

Stephane Hockenhull · Accepted Answer · 2017-09-19 01:42:08Z

Different GPUs use different tricks and techniques so this answer will intentionally be very generic and some detail may not apply to some GPUs, past, present (2017), and future.

This mostly applies to personal computer GPUs. Mobile (phones, tablets) GPUs try to be smarter, less wasteful, and save on memory bandwidth but still follow a similar process. Covering all GPUs in details would make for a very long answer.

The stages of rendering are intentionally not in order.

In bold are the parts covering parallelism

The stages can run in a pipeline where the next draw call starts being processed before the previous finishes, and while other GPU cores are drawing.

Drawing Pixels

Modern GPUs don't draw one triangle then the next one anymore. It's much more complicated to maximize parallelism:

GPUs split the drawing into WxH tiles (this depends on the GPU but let's go with 8x8 for our example).

Each tile can be drawn in parallel by the GPU regardless of the drawing mode.

GPU cores are (usually) arranged in groups of WxH (eg: 8x8 = 64 cores) that matches the tile, one group can render to one tile while the other renders another tile.

The entire group execute the same shader at the same time; Rendering to all the tile's pixels at the same time. If some pixels are not to be drawn (ie: the triangle covers only part of the tile) those cores still execute but in a disabled mode that ignores the result and writes nothing. Mobile GPUs often work with smaller core groups to be less wasteful but still render multiple pixels in the tile at the same time.

On some GPUs two or more non-overlapping triangles using the same shader rendered in the same draw call covering the same tile will be combined and rendered at the same time in that tile.
Other GPUs will have to make one pass for each triangle.

This means worst case if you have a triangle of 1 pixel in size you will have 1 core in the group drawing something useful and the other 63 cores in the tile in "ignore-result" mode doing "nothing".

Transforming Vertices

The same core groups (unified GPU) also process incoming vertices (vertex shader).

Incoming vertex batches are processed N vertices at a time (eg: 64 or a multiple of it at a time in our above 8x8 core arrangement) and dumped into a transformed temporary buffer.

The GPU can start transforming vertices of the next draw call before the first is done transforming so vertex work on different draw calls can be done in parallel.

Clipping & Geometry Shaders

The temporary outputs from the vertex transform are culled and clipped into more triangles, optionally running a geometry shader before clipping, and written to yet another temporary buffer.

Each buffer processing from the vertex transform can run in parallel.

GPU drivers might combine vertex, geometry, and/or clipping stages together into one shader-program execution stage internally.

Tile Triangle Buckets

The transformed and clipped triangles are put into "buckets" (lists, arrays, circular buffer, or other methods), of every tile they cover. One bucket for each render tile, or more (could be one per shader in a queue, depends on GPU & Driver).

When a tile is ready to draw a GPU core group "grabs" it from the queue and starts drawing that tile, the next GPU core group can grab the next tile, and so on...

There's essentially two ways:

As triangles come in, they're shoved into the tile buckets and GPU core groups "grab" the tile as they come and draw the triangles in the list.
All the triangles are sorted into tile buckets first and then the GPU cores start to draw.

The above often depends if it is a cellphone (mobile) GPU or a desktop/laptop GPU but not necessarily.

On some GPUs this bucketing work is done by one or more master-coordinator CPUs that are inside the GPU which coordinate all the "dumber" (but better at drawing pixels) GPU core groups and tell them what to do when.
On other GPUs the cores are more generic and capable enough to do this themselves or at least part of this coordination work.
On some machines this coordination work is done by the main computer CPU itself.
And on yet other (older or integrated GPU) machines the vertex, geometry, and clipping work is done by the CPU itself and the GPU cores only do pixel-shader work.

All the work is cut into small workload batches and as many stages as possible (and practical) to maximize parallelism and striking a balance between parallelism and coordination overhead costs.

Stack Exchange Network

Parallelism in GPU's rasterization process

1 Answer 1

Drawing Pixels

Transforming Vertices

Clipping & Geometry Shaders

Tile Triangle Buckets

You must log in to answer this question.

Hot Network Questions

Parallelism in GPU's rasterization process

1 Answer 1

Drawing Pixels

Transforming Vertices

Clipping & Geometry Shaders

Tile Triangle Buckets

You must log in to answer this question.

Related

Hot Network Questions