Hi everyone, is it possible to use Kaldi Voice Activity Detection (VAD) in Pytorch?
I’m not deeply familiar with Kaldi, but how would you like to use it and what have you tried so far?
Are you stuck at a specific point?
I’m actually working on my thesis with audio data and I want to filter out from every audio non speech frames, I mean sequences were people do not speak. I have read that kaldi VAD works good in this case. Or is there any other option in torchaudio how to do it?
Unfortunately, I don’t know how Kaldi detects the speech and if it’s a filtering algorithm or some kind of machine learning model. If Kaldi would work, you could stick to it and preprocess the data in this way. I’m unsure, if you would like to reimplement Kaldi’s algorithm in
torchaudio or how they should be combined.
I have implemented it this way:
waveform, sample_rate = torchaudio.load(file_path)
waveform = torchaudio.functional.vad(waveform, sample_rate)
and it seems to work but befor VAD it took only 10 - 15 Minutes to train an epoch, and now it needs almost 10 hours per epoch. Have I done something wrong?
This phenomenon might be reasonable when the VAD takes too much time.
It might be feasible to exert VAD on all samples before your training instead of having VAD in DataLoader.