1

For context: I'm trying to create a simple "level monitor" animation of audio data streaming from a microphone. I'm running this code on an iOS device and leaning heavily on the Accelerate framework for data processing.

A lot of what I have so far is heavily influenced by this example project from Apple: https://developer.apple.com/documentation/accelerate/visualizing_sound_as_an_audio_spectrogram

Here are the current steps I'm taking:

  1. Start receiving (Int16) samples from the microphone using AVFoundation.
  2. Store samples until I have at least 1024, then send the first 1024 samples to my processing algorithm.
  3. Convert samples to denormalized Float (single-precision floating point).
  4. Apply a Hanning Window to the samples to prevent aliasing since the number of samples is fairly low, for performance reasons.
  5. Run a Forward DCT-II transformation of the time-domain samples into frequency-domain samples.
  6. Absolute value on all samples.
  7. "Bin" the samples to match the number of bars I have to animate... for each 1024/n samples, find the maximum value in each range.
  8. Normalize each of the bins into the 0...1 range by dividing each by the highest magnitude sample that has been encountered, globally.

Honestly, after step 5, I just have no intuitive understanding of what is going on with the frequency domain values. I get that a higher value means the frequency represented by a single value is more prevalent in the time-domain data... but I don't know what a value of, say 12 vs 6492 means.

Anyway, the end result is that the lowest bin (0...255) has a power that is basically just the overall amplitude, while the higher 3 bins never rise above 0.001. I feel like I'm on the right track, but that my ignorance of what the DCT output means is preventing me from figuring out what is going wrong here. I could also use FFT, if that would produce a better result, but I'm given to understand that FFT and DCT produce analogous results and Apple recommends DCT for performance.

2
  • Is the goal a single overall level? Or multiple frequency bands? Commented Apr 30, 2024 at 22:03
  • In this case, 4 bands, but I'd like a solution that could scale to an arbitrary (but probably < 20) number of bands. Commented May 1, 2024 at 16:09

2 Answers 2

2

The DFT/DCT is linear in its input. So when the inputs are an amplitude (which is the case from a standard audio file or microphone input), so is the output.

It seems this will be used for visualization. In that case, I recommend converting the amplitudes into decibels. It will make the range of values much more compact, which is desirable when showing on finite screen real estate, and also quite conventional. For an amplitude that is 20*log10(amp/ref), where ref might be just 1.0 if you are just going to normalize it afterward anyway (8). Note, normalization in decibel domain would be an additive shift, not dividing.

The frequency bins of the DCT are k/(2N) * fs, where k is the bin, N is the length of the transform, and fs is the samplerate.

Sign up to request clarification or add additional context in comments.

4 Comments

Do you have any idea why my results seem extremely biased into the lower quarter of the frequency domain samples. In theory, my Nyquist frequency would be 512Hz (half of 1024 samples), so each of the frequency domain indices would represent 1/2Hz of difference, right? To me, that means it's finding all of the signal power below 128Hz, which is deeply suspicious for human voice.
@JoshuaSullivan if this is audio, your samplerate is likely not 1024 Hz? The DFT/DCT length influences the resolution of the frequency, but not the range!
So given that my data is 44.1kHz 16-bit audio (-32k to 32k) values, does that mean that the bin size (forget the grouping for now) is 64,000 / 2048 * 44100?
The amplitude does not factor in. The frequency resolution of the DCT coefficients would be 44100/(2048*2) = 10.76 Hz per bin. So 440 Hz would be around bin number 40-41.
1

So the conversation with Juha P on the DCT stack exchange made me realize that I had been badly misinterpreting the results of the DCT. Once I used a sweep tone, it became clear that my data was fine, but the range was crazy-inappropriate for my intended use.

I found out that iPhone microphones record at 48kHz by default, making my Nyquist frequency 24kHz. Given that human speech is generally in the 100-300Hz range, the problem was that all of the usable data was in the bottom few bins of the 1024 result values.

I changed the "grouping" step (#7) to only look at bins 1-41 (out of 1024), using 10 bins per "output" value. Immediately, I could see the effect I wanted, with the bars responding to changes in pitch in my voice.

Big thanks to Juha P and Jon Nordby for helping me figure this out!

1 Comment

For speech, you can consider recording in, or resampling to, 8000 Hz. You may also consider applying a normalization that emulates, such as A weighting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.