std::fill(std::begin(histogram_output), std::end(histogram_output), std::size_t{ 0 });
A simpler solution is to aggregate-initialize the array (braces) instead of default-initializing it:
std::array<std::size_t, /* skipped to save space */> histogram_output{};
for (std::size_t i = 0; i < image_data.size(); ++i) { ++histogram_output[image_data[i]]; }
This is the simplest histogram algorithm that isn't bad, so it has a good "performance to complexity" ratio. There are some techniques to do it faster, they will introduce more complexity, so whether that's worth it is a trade-off that you'd have to decide. Turbo-Histogram has collected many of those techniques.
A significant downside of the simple algorithm is as follows. If the same entry of the histogram is incremented again when it has recently been incremented already (as happens not very often for random data, but happens significantly often for various kinds of structured data including non-random images) then the second increment has a dependency on the first increment through the associated store and load. How bad that is depends on microarchitectural details of the CPU you're running the code on, but it generally isn't great and tends to result in significantly lower performance than independent increments.
The algorithms available in Turbo-Histogram do some different things:
- Build multiple independent histograms, and sum them in the end. This gives the CPU some independent work to do even if the same value occurs multiple times in a short span.
- Detect short runs of equal values and count them all in one go (in clever ways that make detection cheap, but that keep runs somewhat short, eg up to 8 bytes). This avoids the worst case of the simple algorithm and turns it into the best case: runs of equal values can be counted much faster than one-at-a-time. But, if runs of equal values are not common enough, trying to find them is not necessarily a win.
On top of that there are various optimizations such as replacing 16 individual byte loads with one 128-bit load (_mm_loadu_si128) and 16 extract-byte operations (_mm_extract_epi8), which variant is best depends a lot on the specific CPU microarchitecture: the costs of various operations on that microarchitecture and specific quirks. It's different between Intel Haswell and Intel Rocket Lake and so on.
Of these, the only technique that can easily be agnostic of ElementT is building multiple independent histograms and summing them in the end. Other techniques could be used by having a specialization for the (presumably most common) case that ElementT = std::uint8_t.
ElementT had better be an unsigned integral type and not too large, otherwise the whole thing breaks down. Perhaps that ought to be enforced.