I currently have a few thousand audio clips that I need to classify with machine learning.
After some digging I found that if you do a short time fourier transform on the audio, it turns into a 2 dimensional image so I can use various image classification algorithms on these images instead of the audio files themselves.
To this end I found a python package that does the STFT and all I need is to plot it so I can get the images. For plotting I found this github repo very useful.
Finally my code ended up as this:
import stft
import scipy
import scipy.io.wavfile as wav
import matplotlib.pylab as pylab
def save_stft_image(source_filename, destination_filename):
fs, audio = wav.read(source_filename)
X = stft.spectrogram(audio)
print X.shape
fig = pylab.figure()
ax = pylab.Axes(fig, [0,0,1,1])
ax.set_axis_off()
fig.add_axes(ax)
pylab.imshow(scipy.absolute(X[:][:][0].T), origin='lower', aspect='auto', interpolation='nearest')
pylab.savefig(destination_filename)
save_stft_image("Example.wav","Example.png")
The code works, however I observed that when print X.shape line executes I get (513L, 943L, 2L). So the result is 3 dimensional. So when I only write X[:][:][0] or X[:][:][1] I get an image.
I keep reading this "redundancy" STFT has, that you can remove the half because you would not need it. Is that 3rd dimension that redundancy or am I doing something very wrong here? If so how do I properly plot it?
Thank you.
Edit: So the new code and output is:
import stft
import os
import scipy
import scipy.io.wavfile as wav
import matplotlib.pylab as pylab
def save_stft_image(source_filename, destination_filename):
fs, audio = wav.read(source_filename)
audio = scipy.mean(audio, axis = 1)
X = stft.spectrogram(audio)
print X.shape
fig = pylab.figure()
ax = pylab.Axes(fig, [0,0,1,1])
ax.set_axis_off()
fig.add_axes(ax)
pylab.imshow(scipy.absolute(X.T), origin='lower', aspect='auto', interpolation='nearest')
pylab.savefig(destination_filename)
save_stft_image("Example.wav","Example.png")
On the left I get an almost invisible column of colors. The sounds I am working on are respiratory sounds, so they have very low frequencies. Maybe that's why the visualization is a very thin column of colors.

