Must the cores of a multi-core CPU all share L3 caches? is it possible that a cpu has several L3 level caches? For example, suppose a cpu has 24 cores, and no three cores share a L3 cache, so there are 8 L3 caches.
1 Answer
AMD Zen family does this with each "core complex" (CCX) of 4 or 8 cores sharing an L3, but no whole-chip shared cache outside that. AMD's Infinity Fabric connects the CCXs to each other and to memory controllers and I/O, with many-core CPUs build out of multiple modules of CCXs + memory controllers + I/O.
This is a lot like traditional multi-socket systems where each socket had a chip with one shared L3 for all its cores, and links to other sockets with snoop filters to keep bandwidth down to manageable levels (and keep latency fast within one socket / CCX). There are NUMA-style inter-core latency differences for pairs of cores within the same CCX vs. in different CCXs.
The low-end models only have one CCX, which is up to 4 cores in Zen 1 & 2,
or up to 8 cores in Zen 3 and 4. The amount of L3 cache per CCX can vary by model with one generation.
For more details see:
https://en.wikichip.org/wiki/amd/microarchitectures/zen#CPU_Complex_.28CCX.29
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Core_Complex
https://en.wikichip.org/wiki/amd/microarchitectures/zen_3#Key_changes_from_Zen_2
https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memory-subsystem-and-conclusion/
Intel has also done this in a much worse way, for Core 2 Quad by basically sticking two Core 2 Duo dies in one package, with the interconnect between them being the FSB (frontside bus) which was about as slow as going to DRAM. (Last-level cache in those days was L2, so it was two separate L2 caches.) See the "Final Words (Dunnington)" section in chips&cheese's historical look back at Dunnington for some description of how things worked in Core 2 Quads that didn't have its uncore / shared L3, literally just having the other chip snoop the shared FSB and respond instead of DRAM if it had a copy of the line.
Some modern chips have groups of 2 to 4 cores sharing a medium-sized L2, but with multiple groups on the same processor all backed by a large shared L3. For example Intel's E-cores in Alder Lake do this.
AMD's Bulldozer-family did even tighter coupling of a pair of weak integer cores sharing a front-end and L1i cache, and the SIMD/FP unit (calling it CMT as an alternative to SMT.) But separate per-core write-through L1d caches with a shared L2. https://www.realworldtech.com/bulldozer/2/. There was a single L3 shared across the whole chip, though. Bulldozer was overall not very high performance for a lot of reasons.
ARM Cortex-A510 can be clustered in a similar way, sharing an FPU, L2 cache, and L2 TLB. (chipsandcheese discusses the tradeoffs for that in-order efficiency core). But again, there's normally a shared L3 as a backstop outside this.
Apple A14 has 8MiB of L2 cache shared between the two Firestorm big-cores. But there's also a slower L3 shared last-level cache for them + the Ice Storm E-cores and the GPU etc.