Kaldi Voice Activity Detection (VAD)

Kla · November 16, 2020, 6:19pm

Hi everyone, is it possible to use Kaldi Voice Activity Detection (VAD) in Pytorch?

ptrblck · November 18, 2020, 10:22am

I’m not deeply familiar with Kaldi, but how would you like to use it and what have you tried so far?
Are you stuck at a specific point?

Kla · November 18, 2020, 1:05pm

I’m actually working on my thesis with audio data and I want to filter out from every audio non speech frames, I mean sequences were people do not speak. I have read that kaldi VAD works good in this case. Or is there any other option in torchaudio how to do it?

ptrblck · November 19, 2020, 7:43am

Unfortunately, I don’t know how Kaldi detects the speech and if it’s a filtering algorithm or some kind of machine learning model. If Kaldi would work, you could stick to it and preprocess the data in this way. I’m unsure, if you would like to reimplement Kaldi’s algorithm in torchaudio or how they should be combined.

vincentqb · November 19, 2020, 9:08pm

torchaudio has an implementation of VAD based on sox, see here, and another implemented as an example here. Let us know how your experience goes

Kla · November 21, 2020, 2:22pm

I have implemented it this way:

waveform, sample_rate = torchaudio.load(file_path)
waveform = torchaudio.functional.vad(waveform, sample_rate)

and it seems to work but befor VAD it took only 10 - 15 Minutes to train an epoch, and now it needs almost 10 hours per epoch. Have I done something wrong?

Alexuan · January 12, 2021, 10:16am

Hi!
This phenomenon might be reasonable when the VAD takes too much time.
It might be feasible to exert VAD on all samples before your training instead of having VAD in DataLoader.

vincentqb · June 8, 2021, 10:12pm

@Kla – have you tried apply_effects_tensor with VAD effect? see here.