Different GPUs use different tricks and techniques so this answer will intentionally be very generic and some detail may not apply to some GPUs, past, present (2017), and future.
This mostly applies to personal computer GPUs. Mobile (phones, tablets) GPUs try to be smarter, less wasteful, and save on memory bandwidth but still follow a similar process. Covering all GPUs in details would make for a very long answer.
The stages of rendering are intentionally not in order.
In bold are the parts covering parallelism
The stages can run in a pipeline where the next draw call starts being processed before the previous finishes, and while other GPU cores are drawing.
Drawing Pixels
Modern GPUs don't draw one triangle then the next one anymore. It's much more complicated to maximize parallelism:
GPUs split the drawing into WxH tiles (this depends on the GPU but let's go with 8x8 for our example).
Each tile can be drawn in parallel by the GPU regardless of the drawing mode.
GPU cores are (usually) arranged in groups of WxH (eg: 8x8 = 64 cores) that matches the tile, one group can render to one tile while the other renders another tile.
The entire group execute the same shader at the same time; Rendering to all the tile's pixels at the same time. If some pixels are not to be drawn (ie: the triangle covers only part of the tile) those cores still execute but in a disabled mode that ignores the result and writes nothing. Mobile GPUs often work with smaller core groups to be less wasteful but still render multiple pixels in the tile at the same time.
On some GPUs two or more non-overlapping triangles using the same shader rendered in the same draw call covering the same tile will be combined and rendered at the same time in that tile.
Other GPUs will have to make one pass for each triangle.
This means worst case if you have a triangle of 1 pixel in size you will have 1 core in the group drawing something useful and the other 63 cores in the tile in "ignore-result" mode doing "nothing".
Transforming Vertices
The same core groups (unified GPU) also process incoming vertices (vertex shader).
Incoming vertex batches are processed N vertices at a time (eg: 64 or a multiple of it at a time in our above 8x8 core arrangement) and dumped into a transformed temporary buffer.
The GPU can start transforming vertices of the next draw call before the first is done transforming so vertex work on different draw calls can be done in parallel.
Clipping & Geometry Shaders
The temporary outputs from the vertex transform are culled and clipped into more triangles, optionally running a geometry shader before clipping, and written to yet another temporary buffer.
Each buffer processing from the vertex transform can run in parallel.
GPU drivers might combine vertex, geometry, and/or clipping stages together into one shader-program execution stage internally.
Tile Triangle Buckets
The transformed and clipped triangles are put into "buckets" (lists, arrays, circular buffer, or other methods), of every tile they cover. One bucket for each render tile, or more (could be one per shader in a queue, depends on GPU & Driver).
When a tile is ready to draw a GPU core group "grabs" it from the queue and starts drawing that tile, the next GPU core group can grab the next tile, and so on...
There's essentially two ways:
- As triangles come in, they're shoved into the tile buckets and GPU core groups "grab" the tile as they come and draw the triangles in the list.
- All the triangles are sorted into tile buckets first and then the GPU cores start to draw.
The above often depends if it is a cellphone (mobile) GPU or a desktop/laptop GPU but not necessarily.
On some GPUs this bucketing work is done by one or more master-coordinator CPUs that are inside the GPU which coordinate all the "dumber" (but better at drawing pixels) GPU core groups and tell them what to do when.
On other GPUs the cores are more generic and capable enough to do this themselves or at least part of this coordination work.
On some machines this coordination work is done by the main computer CPU itself.
And on yet other (older or integrated GPU) machines the vertex, geometry, and clipping work is done by the CPU itself and the GPU cores only do pixel-shader work.
All the work is cut into small workload batches and as many stages as possible (and practical) to maximize parallelism and striking a balance between parallelism and coordination overhead costs.