What is the best way to assess audio quality in an embedded or Android system?

Question

In recent posts, Evaluating THD for any Test Tone and any Sample Rate, I have attempted to assess the audio quality of my Android device by using Oboe in Native C++, and evaluate THD in real time using FFT (KissFFT, easy to use). For some reason my code crashes when using an FFT window of size other than a power of 2, it might be because of a bug in my code I didn't catch, but I was working the whole time with the assumption that for optimization reasons only such windows were usable with KissFFT, this remark is important to motivate why I was so frustrated by this restriction and looked for general solutions.

My experience is limited, so I did whatever made sense: I heard THD is the standard metric to assess audio quality in an audio system, FFT seemed the way to compute it, I do not understand how the maths is derived but I understand the resulting formulas and tried to apply them.

I quickly ran into trouble however, as I realized that the frequency of the test tone had to match the FFT window size and sample rate, otherwise mathematical artifacts start appearing, and a THD which should come out as -80dB jumps up to -60dB (a whooping 20dB of distortion completely due to maths working out poorly). The solution is either to think carefully about what test tone to use, or switch to arbitrary FFT window sizes (not just 2^n) and do the equivalent (in complexity) of choosing the test tone carefully by adapting the FFT window to the arbitrary test tone. I asked about a third approach: adding more maths to remove the artifacts, but I was told that it wasn't trivial (Dirichlet kernel) and I assume it's probably very computationally inefficient.

It was hinted to me however, that there are possibly better approaches to assess audio quality in a system using C++ than using FFT to compute THD. I wondered if someone would be so kind to let me know what the better approach is, as I just dove head first in what seemed like standard/textbook practice, but I suppose it was the undergraduate/naive approach to the problem.

Are you interested in assessing quality of your Android device's input (mic, preamp, ADC) or output (DAC, amp, speaker) or both? — Mark
– Mark, Commented Oct 25 at 14:10
Hi @Mark, I'd say both, but my current focus is the input side: I need to make sure that audio doesn't get clipped, that deep in the hardware layers audio doesn't get dropped (incomplete inputs), or other issues arise, and ideally I'd need that to be detected automatically and only have a human look at it when my program raises a red flag. — Arno
– Arno, Commented Oct 25 at 19:53
Focusing on the input side first, is it fair to say that your system model is: External audio source coupled through the Android device microphone to a software process that delivers frames of audio to your application? Your objectives are: (1) Detect clipped audio, and (2) detect dropped frames? Can you tell us more about the external source and the acoustic environment the system operates (indoor quiet environment, outdoor noisy environment, etc.)? — Mark
– Mark, Commented Oct 25 at 22:53
Hi @Mark, thank you for your feedback, it's true that I realize I took a bit for granted "audio quality in C++" was enough for everyone to infer what the context is, that's a mistake on my part. Unfortunately I am not an Android System engineer, so I'll do my best to fill the blanks. The idea is, that I get the audio in the Android Native C++ space (using Oboe, which uses AAudio) and check that the audio is "clean". I guess the audio setup is microphones as well as small-jack aux inputs (preferred), running at 16kHz sample rate, I think 16bit, managed by the Audio HAL layer. — Arno
– Arno, Commented Oct 26 at 12:32
Overall, I'm trying to check that whatever audio is delivered to an app developer is "clean", and trying to detect at that level if anything went wrong to then dig deeper into things I am not familiar with eg. how the Audio HAL itself works, how the actual physical hardware behaves, what bus/connector is used to pass data. Checking that there are no gaps in data (can happen if for example audio HAL thread runs with low priority by mistake), and no clipping or other artifacts. — Arno
– Arno, Commented Oct 26 at 12:35

Marcus Müller · Accepted Answer · 2025-10-25 13:44:50Z

I heard THD is the standard metric to assess audio quality in an audio system,

It's one of many measures; it describes the linearity of a system, but it doesn't tell you anything about frequency-dependent behaviour, noise figures, gain stability…

So, first recommendation here: take a step back and think about what you are doing this for! Do you really care about THD? I'm personally not sure, at least not from this sentence. Is someone asking you to measure the THD of something? Are you trying to understand whether e.g. an amplifier has been designed with enough headroom to avoid compression? Do you suspect unwanted elasticities in some acoustomechanic setup? Be sure about what you want to achieve!

FFT seemed the way to compute it,

That's only true if you can precisely produce test tones that fall into the raster of frequencies the DFT (of which the FFT is just one special kind of implementation; what is computed is the DFT, the FFT is just the "math code" to achieve a DFT). That's why I was very skeptical of your windowing: In the case that your tone falls exactly into that raster, you don't gain anything from windowing anymore. In case it doesn't, your windowing "smears" your tone a bit more nicely, but that doesn't really help much with calculating a good THD+N figure (let alone a THD).

a THD which should come out as -80dB jumps up to -60dB (a whooping 20dB of distortion completely due to maths working out poorly

Rest assure that if your audio system achieves a -60 dB THD, it has done a great job. Consumer audio typically is more like -30 dB, i.e. 1000 times as much power in unwanted harmonics. If this is your sole complaint, stop here and think about the things you want to quantify. If you don't understand what you're quantifying, nor what typical performance looks like, all you get in the end is a number and not even a speckle of understanding.

Also note that your measurements not only consist of a signal and its harmonics, but also happen in a noisy environment – from acoustic noise through microphone electric noise through quantization and a host of other contributors – so looking at the noise spectrum of a real observation without caring about the signal and its harmonics might actually be a nice idea for you to get an idea what kind of precision you can achieve at all.

The solution is either to think carefully about what test tone to use, or switch to a more advanced library which allows for arbitrary FFT window sizes and do the equivalent (in complexity, besides learning to use the new FFT library) of choosing the test tone carefully by adapting the FFT window to the arbitrary test tone

No! You need to lay off the FFTs, it's not doing you any favors here. (Also "more advanced FFT library" is… a bit of an overstatement. If you need a 11317-point DFT for some strange reason, then just sit down and do evaluate the DFT sum for each point. It's not really hard math, just a for loop, quite honestly. 4 lines of code, if you don't need it to be pretty; you however don't need that, it's a bad approach to use the DFT/FFT here alltogether.)

There's other methods of estimating the frequency and power of a periodic signal, and almost all of them fit your problem description better:

Can't define the input frequency and hence the frequencies of its harmonics:
Strong indication that a method that gives you only values at discrete points is not really suited to the problem
Needs to be "real-time", whatever that specifically means, but probably means that million-point DFTs are out, since you need to record a million points first, and seeing your sampling rate is not going to be in the Megahertzes, that takes a long time:
Strong indication that a method where frequency granularity is the inverse of vector length to be captured first is wrong
Needs to work on noisy recordings:
Indication that you want a method that doesn't have the same frequency resolution "everywhere" but can be "tuned" to be narrow around the frequencies of interest.

All these points say "hm, we want a spectral estimator that can locate a dominant frequency component arbirtrarily precisely, and one estimator that can give us the power at all multiples of that frequency"; and none says "oh we need to sample the spectrum in equal distance", which is what the DFT does!

So, start simpler!

A spectral estimator to find the fundamental

A very simple software (real-valued) Phase-locked loop (PLL) can be used to "lock" a synthetic oscillation to the same frequency as the fundamental you're observing.

That's it. You first make the loop filter bandwidth large, so it quickly locks onto the strongest frequency component, and then reduce it gradually, to reduce the "wobbling" of the estimate due to noise and environmental variations. The noise variance can be reduced very far – at the cost of getting a very slow control loop that might take a looong time to lock. Realistic bandwidths for audible frequencies should allow you to get a good lock in less than 1/10 of a second.

As soon as things are locked, you use the frequency of your synthetic oscillator: that's your fundamental frequency.

An estimator for the powers of fundamental and harmonics.

Great, we now multiply the input with our synthetic oscillator. What happens there is that this allows you to shift your fundamental tone from $\pm f_{\text{fund.}}$ (remember that real-valued signals are always symmetrical in spectrum!) to

$-f_{\text{fund.}}-f_{\text{fund.}} =-2f_{\text{fund.}}$,
$-f_{\text{fund.}}+f_{\text{fund.}} =+f_{\text{fund.}}-f_{\text{fund.}} = 0$, and
$+f_{\text{fund.}}+f_{\text{fund.}} = +2f_{\text{fund.}}$.

A narrow fixed low-pass filter can be used to isolate the 0-frequency component. Bang! square that and you have your fundamental energy. Again, the narrower the filter, the less noise in your estimate, but the longer it takes to stabilize.

Do the same for multiples of $f_{\text{fund.}}$, and you will get the powers for the harmonics. (And you can do these things in parallel on the same input; so assuming your computer is fast enough, it makes no real-time difference whether you only look at the first three or the first 17 harmonics. It's more of a question whether you want to consider energy at the 17. harmonic frequency a "harmonic"; usually, the physical models of what leads to that says "no, that's nonsensical".)

Now, square root of the sum of powers in harmonics divided by square root of fundamental power: THD. Done!

Things where the FFT might actually start to appear again

Filtering can be a computationally intense thing. It's usually not for so few filters, at audio sampling rates, for modern microcontrollers or application processors. It still might be a non-trivial amount of CPU; you can make that faster using fast-convolution approaches that internally use the FFT. But that is not doing the spectral analysis through FFTs, it's effectively just speeding up convolution through Fourier equivalences.

You can also use a small, fixed-size FFT (think: 32 bins or so) to get an initial guess for at what frequency to start the PLL upon initialization. But that's more like an oracle to give you better initial guessing than a detector.

Thanks for suggesting the PLL approach @Marcus Müller, I wasn't aware of it. I've asked genAI to write me a PLL, but the results have been very bad. It seems also that PLLs are a kind of PID controller, another concept I'm vaguely familiar with, which have issues of stability associated with them. Would you been kind enough to show an implementation sketch in Python I (and readers) could start playing with ? I hope it's not too much to ask, I've tried doing the homework and tweaking what genAI gave me, but it didn't work out. — Arno
– Arno, Commented Oct 26 at 15:05
I'm not surprised that results of language model generated code aren't great if there's parameters to adjust to your needs; you will need to occasionally learn some background, then things become actually easy :) Writing the code is not the problem (there's thousands of software PLLs around that weren't written by coked-up random generators, but by engineers with a purpose and understanding; maybe look for these instead of getting a questionable one written for you by AI?), understanding the parameters is. And that's the whole crux here: I can't tell you what you need. You need to do that. — Marcus Müller
– Marcus Müller, Commented Oct 26 at 15:35
Hi again. I got lucky today and genAI got me a PLL that worked, though the gains were wrong. I threw different signals at the PLL and adjusted the gains through trial and error, and I have a PLL that guesses 5k initially and converges to any frequency within audible range rather quickly. I tried feeding it chirps from 1 second to 1 minute, and squares and sawtooths. It is oscillatory but averaging the guessed frequency over 0.1 second gives me the correct value. I'm not sure I understand how it all works, but thank you for introducing me to PLLs. — Arno
– Arno, Commented Nov 4 at 16:49
hey, happy for you, but fair warning, I think: I'm not gonna invest human time in code that no human spent time on designing. That's not a way to get forward. — Marcus Müller
– Marcus Müller, Commented Nov 4 at 16:50
no, you really don't need a semester. There's enough example code out there that you could just read and understand; you letting a language model (there's nothing artificially intelligent about that, that's just marketing) write one for you gives you code you don't understand and depend on other people understanding for you if it doesn't work; that's a time sink for other people, and for you too. — Marcus Müller
– Marcus Müller, Commented Nov 4 at 17:01

Hilmar · Accepted Answer · 2025-10-25 14:11:03Z

2

I heard THD is the standard metric to assess audio quality in an audio system,

It is one of many and whether it's suitable or not depends a lot on what exactly trying to do.

In the context of an Android phone, you could look at

Microphone placement, and quality
Microphone preprocessing and "enhancement" algorithms (beamformer, voice activity detector, speech enhancers, etc)
OS related processing (mixers, 3rd party plug ins)
Carrier imposed processing (the exact same phone can behave differently depending on whether it's on Verizon or T-Mobile).
Non-linear protection algorithms (voltage, current, excursion, short term thermal, term thermal)
Vibration, rub & buzz
Quality of speaker or playback systems (build-in speaker tend to be VERY non-linear)
Amplifier distortion
Probably another dozen or so

And that's just for an Android phone. "Audio System" is a very broad term and so is "Audio Quality". I think it would help if you would narrow things down to a specific use case with a specific set of requirements and then we may able to point you towards a suitable algorithm(s)

answered Oct 25 at 14:11

Hilmar

50.4k1 gold badge33 silver badges69 bronze badges

$\begingroup$ Thank you for opening up the perspectives @Hilmar. I replied to comments by Mark on my main question, you could read up what I wrote there, as I provide more context. I am not a saavy system designer, so I wouldn't know where/how to begin looking into all these levels of detail, though eventually I should jump in the deep end. However I am hoping that checking integrity of audio in "real time" (show a user a UI, so about 300ms reaction time) in the native C++ layer of Android would help me catch errors and guess where to look where it came from. $\endgroup$

Arno
– Arno

2025-10-26 12:40:53 +00:00
Commented Oct 26 at 12:40
$\begingroup$ All the points you mentioned would have effects that would accumulate and be detectable in the native Android C++ layer. Certainly, I could go an look at each and everyone of them (again, no experience doing that), but I was hoping an "all purpose" audio quality detector in C++ would be the canary I could use, to detect problems. I'm looking for the best thing I could implement at that level of the overall system. $\endgroup$

Arno
– Arno

2025-10-26 12:45:13 +00:00
Commented Oct 26 at 12:45
$\begingroup$ If you want to detect how "clean" audio is, you need define what exactly you mean by that. For example I had a project where we find that Android phones in general were not "clean" enough for our application but iOS phones were. But that statement only makes sense in the context of our specific set of requirements. Your best shot here may be define a set of cleanliness criteria (that matter in your specific context) with thresholds and build specific tests for each one. Clipping is easy to detect, drop outs are more tricky (for an unknown signal). $\endgroup$

Hilmar
– Hilmar

2025-10-26 13:02:16 +00:00
Commented Oct 26 at 13:02

Add a comment |

Mark · Accepted Answer · 2025-10-26 22:17:59Z

If your application ultimately needs to detect clipping in arbitrary speech then using a waveform generator, while beneficial for Android device characterization purposes, may not provide a good solution in your final application. I say this because the utility of a high fidelity generator where you can control the frequency and amplitude of the tone is that it allows for a very simple clip detector: Compare the ratio of energy at a known harmonic frequency to that of the fundamental. I could suggest a specific implementation if that would help.

However with speech, this approach won't work and you'd be better off with a completely different algorithm. For example, forming a histogram of speech samples and looking for peaks at the extremes (which is characteristic of clipping) may be a better approach. Refer to this post for details.

Edit: Simple clip detector using high-fidelity tone generator

A simple clip detector can be constructed by measuring the ratio of harmonic energy to fundamental. If clipping is symmetric, meaning both positive and negative extremes are clipped, then only the odd harmonics will be generated in a clipped sine wave. If clipping is asymmetric then both even and odd harmonics are generated. Because odd harmonics are generated either way, our simple clip detector could compare the energy in the 3rd harmonic to that of the fundamental.

Choosing 1 kHz as the test tone frequency results in 16 samples per period at Fs=16 kHz. Using frame size of 32 ms means there are 512 samples/frame. Using an FFT length of 512 at Fs=16 kHz means the bin spacing is 31.25 Hz, putting the fundamental at bin 32 and the 3rd harmonic at bin 96 (with DC at bin 0).

The clip detector begins by calculating a length 512 FFT for a 32 ms frame. It then calculates the ratio of bin 96 energy to bin 32 and compares the result to an empirically derived threshold. If the ratio exceeds the threshold then clipping is detected, else no clipping.

For the case where the largest 10% of samples are clipped and snr=70 dB, a metric of -29 dB indicates the 3rd harmonic is 29 dB below the fundamental. When there is no clipping the metric is -92 dB showing the energy at the 3rd harmonic is at the noise floor. In this example perhaps a metric threshold around -40 dB is appropriate. Run more test cases varying the clipping level to zero in on a better metric threshold.

Plots overlaying each block |fft|^2 vs frequency.

MATLAB code:

% System parameters
Fs = 16000;   % sample frequency (kHz)
ftone = 1000; % test tone frequency (Hz)
B = 512;      % block size (samples)
snrdb = 70;   % noise level (dB below signal)
CLIP = 0.90;  % clip level (clip above this level)

% Clean test signal
t = 0:1/Fs:1;  % time (seconds)
xclean = sin(2*pi*ftone*t + 0.2); % test waveform (peak amplitude = 1)
Nb = floor(numel(xclean)/B); % samples/block
xclean = xclean(1:Nb*B); % truncate to integer number of blocks

% Noise
noise = 10^(-snrdb/20) * std(xclean) * randn(size(xclean));

% Clipped signal
xclip = xclean;
xclip(xclean>CLIP) = CLIP; % any samples >CLIP are limited to CLIP
xclip(xclean<-CLIP) = -CLIP; % any samples <-CLIP are limited to -CLIP

% Clipped signal + noise
x2 = xclip + noise; % clipped
xc = reshape(x2(:),B,Nb); % Nb blocks of B samples each

% Analyze clipped
xc  = xc - mean(xc); % remove DC
X = fft(xc,B); % length B=512 fft (each block)
X = X(1:256,:); % use one-sided FFT (real-valued input)
X = X ./ max(abs(X)); % normlize for plot

% Kaiser window
w = kaiser(B,38); % beta=38
wx = xc .* repmat(w(:),1,size(xc,2));
WX = fft(wx); % length B fft (each windowed block)
WX = WX(1:256,:); % use one-sided FFT (real-valued input)
WX = WX ./ max(abs(WX)); % normlize for plot

% Unclipped signal + noise
x3 = xclean + noise; % unclipped
xnc = reshape(x3(:),B,Nb); % Nb blocks of B samples each

% Analyze unclipped
xnc  = xnc - mean(xnc); % remove DC
XNC = fft(xnc,B); % length B=512 fft (each block)
XNC = XNC(1:256,:); % use one-sided FFT (real-valued input)
XNC = XNC ./ max(abs(XNC)); % normlize for plot

% Kaiser window
w = kaiser(B,38); % beta=38
wxnc = xnc .* repmat(w(:),1,size(xnc,2));
WXNC = fft(wxnc); % length B fft (each windowed block)
WXNC = WXNC(1:256,:); % use one-sided FFT (real-valued input)
WXNC = WXNC ./ max(abs(WXNC)); % normlize for plot

% Plot results
figure(1);
NT = size(xc,1) / Fs;
f = 0:1/NT:Fs/2-1/NT;

% Clipped
subplot(221);
plot(f,20*log10(abs(X)));grid on;
axis([0 8000 -120 0]);
xlabel('Frequeny (Hz)');
ylabel('|FFT|^2 (dB)');
title('Clip top 10% (Unwindowed)');
Fundamental = abs(X(1+32))^2;
ThirdHarmonic = abs(X(1+3*32))^2;
Metric_dB = 10*log10(ThirdHarmonic / Fundamental);
fprintf(1,'Metric_dB=%.1f dB\n', Metric_dB);

subplot(222);
plot(f,20*log10(abs(WX)));grid on;
axis([0 8000 -120 0]);
xlabel('Frequeny (Hz)');
ylabel('|FFT|^2 (dB)');
title('Clip top 10% (Kaiser Windowed)');
WFundamental = abs(WX(1+32))^2;
WThirdHarmonic = abs(WX(1+3*32))^2;
WMetric_dB = 10*log10(WThirdHarmonic / WFundamental);
fprintf(1,'Metric_dB(windowed)=%.1f dB\n', WMetric_dB);

% Unclipped
subplot(223);
plot(f,20*log10(abs(XNC)));grid on;
axis([0 8000 -120 0]);
xlabel('Frequeny (Hz)');
ylabel('|FFT|^2 (dB)');
title('No Clipping (Unwindowed)');
Fundamental = abs(XNC(1+32))^2;
ThirdHarmonic = abs(XNC(1+3*32))^2;
Metric_dB = 10*log10(ThirdHarmonic / Fundamental);
fprintf(1,'Metric_dB=%.1f dB\n', Metric_dB);

subplot(224);
plot(f,20*log10(abs(WXNC)));grid on;
axis([0 8000 -120 0]);
xlabel('Frequeny (Hz)');
ylabel('|FFT|^2 (dB)');
title('No Clipping (Kaiser Windowed)');
WFundamental = abs(WXNC(1+32))^2;
WThirdHarmonic = abs(WXNC(1+3*32))^2;
WMetric_dB = 10*log10(WThirdHarmonic / WFundamental);
fprintf(1,'Metric_dB(windowed)=%.1f dB\n', WMetric_dB);

Thank you for the pointers at speech specific audio quality applications. I think I will get there eventually and the references you've pointed at will be very valuable. I would be happy to take the suggestion on the implementation of the high fidelity generator test, because I have not been able to make one work based on Marcus Müller's description yet, although, due to lack of time, I have not tried extremely hard, though I did spend an hour on it. I'll be sure to accept an answer and mark this question as "solved" as soon as I got a good grasp of the tips kindly provided to me. — Arno
– Arno, Commented Oct 26 at 20:20
Thank you very much for the walkthrough Mark ! I think the FFT THD computation I already had is close to what you have provided, however you have given me more by showing me what criteria to apply, what harmonics to look for, what kind of threshold to set, and also a hint to the speech clipping measurements from the other thread. I will try to take this with me and see if I can make Marcus' PLL approach work as well. I will need some time to make attempts though. — Arno
– Arno, Commented Oct 27 at 8:50
Hi @Mark. Thank you again for your reply. I wish I could validate two answers, because the insight you gave is very valuable, but I guess I'll validate Marcus' answer, as it offered a way to compute THD without spectral leakage. Your way of measuring clipping and giving thresholds for it, will be very useful though, so thank you again. — Arno
– Arno, Commented Nov 4 at 16:58

Stack Exchange Network

What is the best way to assess audio quality in an embedded or Android system?

3 Answers 3

A spectral estimator to find the fundamental

An estimator for the powers of fundamental and harmonics.

Things where the FFT might actually start to appear again

Edit: Simple clip detector using high-fidelity tone generator

Your Answer

Linked

Hot Network Questions

What is the best way to assess audio quality in an embedded or Android system?

3 Answers 3

A spectral estimator to find the fundamental

An estimator for the powers of fundamental and harmonics.

Things where the FFT might actually start to appear again

Edit: Simple clip detector using high-fidelity tone generator

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions