How to apply CNN to live/variable-length audio?

anon_1101 · September 11, 2022, 10:58pm

I have a simple convolutional model capable of classifying 8 spoken words from ~1 second long audio clips at a sample_rate of 16000. Training/test accuracy is good enough to where I want to start testing it on live microphone audio recorded directly on the device. How would I write an inference function that takes a variable-length audio input (microphone) instead of the 1-sec long audio it was trained on?

I assume I have to use sliding windows with a width equal to my sample rate or something similar, however, I am not too confident in this approach.

Feedback and advice are greatly appreciated, thank you.

ElsebaiyMohamed · May 26, 2023, 12:03pm

i think this may be valid for you

all_audio = [] # intial empty
while True:
    mic_audio = #get from mic
    all_audio.extend(mic_audio )
    if len(all_audio) > segment_size:
        segment = all_audio[:segment_size]
        all_audio = all_audio[stride_size:]
        result_of_segment = my_model(segment)

bloos · June 13, 2023, 7:44am

I’ve run some experiments with 1d convolutions on audio data of variable length too and have discovered some problems that can occur:

the network might learn the signal length indirectly and turn it into a feature. Probably not what you want
the approach forces you to use a batch size of 1, which means that you cannot use batch normalization techniques. Other normalization approaches were even worse in my experiments as layer norm led to divergence.

What I did to solve the issues during training:

apply different padding techniques like zero padding or repetition, to feed a whole batch into one tensor
use batch normalization

I’ve read your post again and as it seems you are talking only about inferencing an already existing network. In this case you can slice your data to match the desired input.