I have a simple convolutional model capable of classifying 8 spoken words from ~1 second long audio clips at a
sample_rate of 16000. Training/test accuracy is good enough to where I want to start testing it on live microphone audio recorded directly on the device. How would I write an inference function that takes a variable-length audio input (microphone) instead of the 1-sec long audio it was trained on?
I assume I have to use sliding windows with a width equal to my sample rate or something similar, however, I am not too confident in this approach.
Feedback and advice are greatly appreciated, thank you.