Kaldi Voice Activity Detection (VAD)

Hi everyone, is it possible to use Kaldi Voice Activity Detection (VAD) in Pytorch?

I’m not deeply familiar with Kaldi, but how would you like to use it and what have you tried so far?
Are you stuck at a specific point?

I’m actually working on my thesis with audio data and I want to filter out from every audio non speech frames, I mean sequences were people do not speak. I have read that kaldi VAD works good in this case. Or is there any other option in torchaudio how to do it?

Unfortunately, I don’t know how Kaldi detects the speech and if it’s a filtering algorithm or some kind of machine learning model. If Kaldi would work, you could stick to it and preprocess the data in this way. I’m unsure, if you would like to reimplement Kaldi’s algorithm in torchaudio or how they should be combined.

torchaudio has an implementation of VAD based on sox, see here, and another implemented as an example here. Let us know how your experience goes :slight_smile:

1 Like

I have implemented it this way:

waveform, sample_rate = torchaudio.load(file_path)
waveform = torchaudio.functional.vad(waveform, sample_rate)

and it seems to work but befor VAD it took only 10 - 15 Minutes to train an epoch, and now it needs almost 10 hours per epoch. Have I done something wrong?

This phenomenon might be reasonable when the VAD takes too much time.
It might be feasible to exert VAD on all samples before your training instead of having VAD in DataLoader.

@Kla – have you tried apply_effects_tensor with VAD effect? see here.