Live data from pyaudio and librosa into resnet32

Sundar_Krishna · April 21, 2020, 9:00am

Dear people,
As of now I have written a code for cough detection , which detects sound above certain value of decibels and records it for 5 seconds and passes it into my resnet32 model to find out whether its cough or not! I have a doubt whether is it possible to pass real time audio data into saved .pth and .pkl file? Like I have used examples which convert the sound into a mel spectogram using librosa and pass the spectogram data clipped for 5 seconds into my model to detect the coughing . My plan is to create a live mel spectogram graph of real time sound captured by mic and send it into my model without any latency ?
Let me summarize

Is it possible to give only a single frame of mel spectogram into a resnet32 model?
Or
Is it possible to give a continuous streaming mel spectogram which will have multiple frames per second ,(i.e) it will be plotting the mel spectogram of real time sound being captured with mic . Is it possible to pass that streaming mel-spectogram into my resnet32 model?

ptrblck · April 22, 2020, 7:16am

It should be possible, similar to passing a single image to a standard CNN. You would have to set the batch size to 1, but other than that, I don’t think it should be different from a batched approach.
It depends on the general work flow you are using. As long as you can pass a tensor in the right shape (batched or single sample), your model should work from the perspective of PyTorch.
The main question is of course, which streaming framework you would like to use and how you are going to grab and process the data. Do you have anything in mind so far?

Sundar_Krishna · April 22, 2020, 7:58am

Hi Sir , as of now I have used pytorch resnet32 network. I have used librosa mel spectogram datasets to train my model. As of now i have written a program which continuously monitors for loudness above certain decibel (like above 70dB for cough or sneeze) with a microphone using pyaudio. And if there are any sound occuring above that value , I have linked it to another program which records the audio for 5 seconds as a .wav file and passes it into resnet32 to detect whether it is cough or sneeze or not. But the problem is the process of recording and detecting the sound takes 10 seconds. So i want to reduce the latency involved and thats why I wanted to pass live mel spectogram into resnet32 model , i mean live batches from mic changed to mel spectogram and into resnet32 to detect without any delay.

ptrblck · April 22, 2020, 8:09am

I’m not sure, if that would be possible, since you would be limited by the 5s recording time to calculate the mel cepstrum.
Given the frame of 5 seconds, you would calculate the (F)FT, apply the mel scale, use the log, and apply a DCT to get the cepstrum (spectrum of a spectrum).

Note that the longer the time frame, the more frequency bins you’ll get and vice versa.
If you would cut the signal in smaller windows, your frequencies would become “blurred”.

Sundar_Krishna · April 22, 2020, 8:59am

Ok sir , in that case is it possible to create a window which applies for 5 seconds for a continous streaming mel spectogram , to be clear ==> The mel spectogram will be continuously generating and plotting the graph for the sound coming from the mic and for every 5 seconds i could create a window to cut or extract the spectogram and send it inside my trained network? Of course I will lose the middle 5 seconds of the audio data but its ok as of now! Is this possible ? Or if this would cause more latency to my process? If that is the case , my previous idea would be better sir?

Please bear with me as I am a newbie to pytorch , if you feel my questions are wrong sir!

ptrblck · April 22, 2020, 9:05am

Yes, that should be generally possible. Note however, that the implementation would be out of scope for PyTorch and you would need to use your audio processing libraries for it.

Not necessarily. Once you have the initial 5s (and the minimal latency of the system), you could use a sliding window approach to create overlapping windows. This could create a feeling of a “real time” system for the user.

E.g. assuming that the window length is 5s and the model needs 1s to create the predictions, you could shift the window by 1s and produce the next output.

Sundar_Krishna · April 22, 2020, 9:31am

@ptrblck Thank you so much sir , will try that out for sure. Like these latency issues are with respect to my laptop , but when I port the project to RPi , I think I should be ready to expect more latency! I think if thats the case jetson nano should be good as well. Thank you so much for valuable time sir!

ptrblck · April 22, 2020, 9:33am

Keep us updated how the project works out!
It sounds fantastic!

Sundar_Krishna · April 22, 2020, 9:37am

Thank you so much Sir . Sure to update you and the forum on it.