Using DCT to create real-time "levels" animation for microphone input

Question

For context: I'm trying to create a simple "level monitor" animation of audio data streaming from a microphone. I'm running this code on an iOS device and leaning heavily on the Accelerate framework for data processing.

A lot of what I have so far is heavily influenced by this example project from Apple: https://developer.apple.com/documentation/accelerate/visualizing_sound_as_an_audio_spectrogram

Here are the current steps I'm taking:

Start receiving (Int16) samples from the microphone using AVFoundation.
Store samples until I have at least 1024, then send the first 1024 samples to my processing algorithm.
Convert samples to denormalized Float (single-precision floating point).
Apply a Hanning Window to the samples to prevent aliasing since the number of samples is fairly low, for performance reasons.
Run a Forward DCT-II transformation of the time-domain samples into frequency-domain samples.
Absolute value on all samples.
"Bin" the samples to match the number of bars I have to animate... for each 1024/n samples, find the maximum value in each range.
Normalize each of the bins into the 0...1 range by dividing each by the highest magnitude sample that has been encountered, globally.

Honestly, after step 5, I just have no intuitive understanding of what is going on with the frequency domain values. I get that a higher value means the frequency represented by a single value is more prevalent in the time-domain data... but I don't know what a value of, say 12 vs 6492 means.

Anyway, the end result is that the lowest bin (0...255) has a power that is basically just the overall amplitude, while the higher 3 bins never rise above 0.001. I feel like I'm on the right track, but that my ignorance of what the DCT output means is preventing me from figuring out what is going wrong here. I could also use FFT, if that would produce a better result, but I'm given to understand that FFT and DCT produce analogous results and Apple recommends DCT for performance.

Is the goal a single overall level? Or multiple frequency bands? — Jon Nordby
– Jon Nordby, Commented Apr 30, 2024 at 22:03
In this case, 4 bands, but I'd like a solution that could scale to an arbitrary (but probably < 20) number of bands. — Joshua Sullivan
– Joshua Sullivan, Commented May 1, 2024 at 16:09

Jon Nordby · Accepted Answer · 2024-05-01 17:25:36Z

2

The DFT/DCT is linear in its input. So when the inputs are an amplitude (which is the case from a standard audio file or microphone input), so is the output.

It seems this will be used for visualization. In that case, I recommend converting the amplitudes into decibels. It will make the range of values much more compact, which is desirable when showing on finite screen real estate, and also quite conventional. For an amplitude that is 20*log10(amp/ref), where ref might be just 1.0 if you are just going to normalize it afterward anyway (8). Note, normalization in decibel domain would be an additive shift, not dividing.

The frequency bins of the DCT are k/(2N) * fs, where k is the bin, N is the length of the transform, and fs is the samplerate.

answered May 1, 2024 at 17:25

Jon Nordby

6,4241 gold badge24 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Joshua Sullivan Over a year ago

Do you have any idea why my results seem extremely biased into the lower quarter of the frequency domain samples. In theory, my Nyquist frequency would be 512Hz (half of 1024 samples), so each of the frequency domain indices would represent 1/2Hz of difference, right? To me, that means it's finding all of the signal power below 128Hz, which is deeply suspicious for human voice.

Jon Nordby Over a year ago

@JoshuaSullivan if this is audio, your samplerate is likely not 1024 Hz? The DFT/DCT length influences the resolution of the frequency, but not the range!

Joshua Sullivan Over a year ago

So given that my data is 44.1kHz 16-bit audio (-32k to 32k) values, does that mean that the bin size (forget the grouping for now) is 64,000 / 2048 * 44100?

Jon Nordby Over a year ago

The amplitude does not factor in. The frequency resolution of the DCT coefficients would be 44100/(2048*2) = 10.76 Hz per bin. So 440 Hz would be around bin number 40-41.

Joshua Sullivan · Accepted Answer · 2024-05-02 21:53:09Z

1

So the conversation with Juha P on the DCT stack exchange made me realize that I had been badly misinterpreting the results of the DCT. Once I used a sweep tone, it became clear that my data was fine, but the range was crazy-inappropriate for my intended use.

I found out that iPhone microphones record at 48kHz by default, making my Nyquist frequency 24kHz. Given that human speech is generally in the 100-300Hz range, the problem was that all of the usable data was in the bottom few bins of the 1024 result values.

I changed the "grouping" step (#7) to only look at bins 1-41 (out of 1024), using 10 bins per "output" value. Immediately, I could see the effect I wanted, with the bars responding to changes in pitch in my voice.

Big thanks to Juha P and Jon Nordby for helping me figure this out!

answered May 2, 2024 at 21:53

Joshua Sullivan

1,1298 silver badges14 bronze badges

1 Comment

Jon Nordby Over a year ago

For speech, you can consider recording in, or resampling to, 8000 Hz. You may also consider applying a normalization that emulates, such as A weighting.

Collectives™ on Stack Overflow

Using DCT to create real-time "levels" animation for microphone input

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related