1

I currently have a few thousand audio clips that I need to classify with machine learning.

After some digging I found that if you do a short time fourier transform on the audio, it turns into a 2 dimensional image so I can use various image classification algorithms on these images instead of the audio files themselves.

To this end I found a python package that does the STFT and all I need is to plot it so I can get the images. For plotting I found this github repo very useful.

Finally my code ended up as this:

import stft    
import scipy
import scipy.io.wavfile as wav
import matplotlib.pylab as pylab

def save_stft_image(source_filename, destination_filename):
    fs, audio = wav.read(source_filename)
    X = stft.spectrogram(audio)

    print X.shape    

    fig = pylab.figure()    
    ax = pylab.Axes(fig, [0,0,1,1])    
    ax.set_axis_off()
    fig.add_axes(ax)      
    pylab.imshow(scipy.absolute(X[:][:][0].T), origin='lower', aspect='auto', interpolation='nearest')
    pylab.savefig(destination_filename)

save_stft_image("Example.wav","Example.png")

And output is: enter image description here

The code works, however I observed that when print X.shape line executes I get (513L, 943L, 2L). So the result is 3 dimensional. So when I only write X[:][:][0] or X[:][:][1] I get an image.

I keep reading this "redundancy" STFT has, that you can remove the half because you would not need it. Is that 3rd dimension that redundancy or am I doing something very wrong here? If so how do I properly plot it?

Thank you.

Edit: So the new code and output is:

import stft
import os
import scipy
import scipy.io.wavfile as wav
import matplotlib.pylab as pylab

def save_stft_image(source_filename, destination_filename):
    fs, audio = wav.read(source_filename)
    audio = scipy.mean(audio, axis = 1)
    X = stft.spectrogram(audio)

    print X.shape    

    fig = pylab.figure()    
    ax = pylab.Axes(fig, [0,0,1,1])    
    ax.set_axis_off()
    fig.add_axes(ax)      
    pylab.imshow(scipy.absolute(X.T), origin='lower', aspect='auto', interpolation='nearest')
    pylab.savefig(destination_filename)

save_stft_image("Example.wav","Example.png")

enter image description here

On the left I get an almost invisible column of colors. The sounds I am working on are respiratory sounds, so they have very low frequencies. Maybe that's why the visualization is a very thin column of colors.

1 Answer 1

1

You probably have an stereo audio file? So X[:][:][0] and X[:][:][1] correspond to each channel.

You can convert multichannel to mono by scipy.mean(audio, axis=1).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.