I heard THD is the standard metric to assess audio quality in an audio system,
It's one of many measures; it describes the linearity of a system, but it doesn't tell you anything about frequency-dependent behaviour, noise figures, gain stability…
So, first recommendation here: take a step back and think about what you are doing this for! Do you really care about THD? I'm personally not sure, at least not from this sentence. Is someone asking you to measure the THD of something? Are you trying to understand whether e.g. an amplifier has been designed with enough headroom to avoid compression? Do you suspect unwanted elasticities in some acoustomechanic setup? Be sure about what you want to achieve!
FFT seemed the way to compute it,
That's only true if you can precisely produce test tones that fall into the raster of frequencies the DFT (of which the FFT is just one special kind of implementation; what is computed is the DFT, the FFT is just the "math code" to achieve a DFT). That's why I was very skeptical of your windowing: In the case that your tone falls exactly into that raster, you don't gain anything from windowing anymore. In case it doesn't, your windowing "smears" your tone a bit more nicely, but that doesn't really help much with calculating a good THD+N figure (let alone a THD).
a THD which should come out as -80dB jumps up to -60dB (a whooping 20dB of distortion completely due to maths working out poorly
Rest assure that if your audio system achieves a -60 dB THD, it has done a great job. Consumer audio typically is more like -30 dB, i.e. 1000 times as much power in unwanted harmonics. If this is your sole complaint, stop here and think about the things you want to quantify. If you don't understand what you're quantifying, nor what typical performance looks like, all you get in the end is a number and not even a speckle of understanding.
Also note that your measurements not only consist of a signal and its harmonics, but also happen in a noisy environment – from acoustic noise through microphone electric noise through quantization and a host of other contributors – so looking at the noise spectrum of a real observation without caring about the signal and its harmonics might actually be a nice idea for you to get an idea what kind of precision you can achieve at all.
The solution is either to think carefully about what test tone to use, or switch to a more advanced library which allows for arbitrary FFT window sizes and do the equivalent (in complexity, besides learning to use the new FFT library) of choosing the test tone carefully by adapting the FFT window to the arbitrary test tone
No! You need to lay off the FFTs, it's not doing you any favors here. (Also "more advanced FFT library" is… a bit of an overstatement. If you need a 11317-point DFT for some strange reason, then just sit down and do evaluate the DFT sum for each point. It's not really hard math, just a for loop, quite honestly. 4 lines of code, if you don't need it to be pretty; you however don't need that, it's a bad approach to use the DFT/FFT here alltogether.)
There's other methods of estimating the frequency and power of a periodic signal, and almost all of them fit your problem description better:
- Can't define the input frequency and hence the frequencies of its harmonics:
Strong indication that a method that gives you only values at discrete points is not really suited to the problem
- Needs to be "real-time", whatever that specifically means, but probably means that million-point DFTs are out, since you need to record a million points first, and seeing your sampling rate is not going to be in the Megahertzes, that takes a long time:
Strong indication that a method where frequency granularity is the inverse of vector length to be captured first is wrong
- Needs to work on noisy recordings:
Indication that you want a method that doesn't have the same frequency resolution "everywhere" but can be "tuned" to be narrow around the frequencies of interest.
All these points say "hm, we want a spectral estimator that can locate a dominant frequency component arbirtrarily precisely, and one estimator that can give us the power at all multiples of that frequency"; and none says "oh we need to sample the spectrum in equal distance", which is what the DFT does!
So, start simpler!
A spectral estimator to find the fundamental
A very simple software (real-valued) Phase-locked loop (PLL) can be used to "lock" a synthetic oscillation to the same frequency as the fundamental you're observing.
That's it. You first make the loop filter bandwidth large, so it quickly locks onto the strongest frequency component, and then reduce it gradually, to reduce the "wobbling" of the estimate due to noise and environmental variations. The noise variance can be reduced very far – at the cost of getting a very slow control loop that might take a looong time to lock. Realistic bandwidths for audible frequencies should allow you to get a good lock in less than 1/10 of a second.
As soon as things are locked, you use the frequency of your synthetic oscillator: that's your fundamental frequency.
An estimator for the powers of fundamental and harmonics.
Great, we now multiply the input with our synthetic oscillator. What happens there is that this allows you to shift your fundamental tone from $\pm f_{\text{fund.}}$ (remember that real-valued signals are always symmetrical in spectrum!) to
- $-f_{\text{fund.}}-f_{\text{fund.}} =-2f_{\text{fund.}}$,
- $-f_{\text{fund.}}+f_{\text{fund.}} =+f_{\text{fund.}}-f_{\text{fund.}} = 0$, and
- $+f_{\text{fund.}}+f_{\text{fund.}} = +2f_{\text{fund.}}$.
A narrow fixed low-pass filter can be used to isolate the 0-frequency component. Bang! square that and you have your fundamental energy. Again, the narrower the filter, the less noise in your estimate, but the longer it takes to stabilize.
Do the same for multiples of $f_{\text{fund.}}$, and you will get the powers for the harmonics. (And you can do these things in parallel on the same input; so assuming your computer is fast enough, it makes no real-time difference whether you only look at the first three or the first 17 harmonics. It's more of a question whether you want to consider energy at the 17. harmonic frequency a "harmonic"; usually, the physical models of what leads to that says "no, that's nonsensical".)
Now, square root of the sum of powers in harmonics divided by square root of fundamental power: THD. Done!
Things where the FFT might actually start to appear again
Filtering can be a computationally intense thing. It's usually not for so few filters, at audio sampling rates, for modern microcontrollers or application processors. It still might be a non-trivial amount of CPU; you can make that faster using fast-convolution approaches that internally use the FFT. But that is not doing the spectral analysis through FFTs, it's effectively just speeding up convolution through Fourier equivalences.
You can also use a small, fixed-size FFT (think: 32 bins or so) to get an initial guess for at what frequency to start the PLL upon initialization. But that's more like an oracle to give you better initial guessing than a detector.