Adapting wav2vec2 to stream of int16

barakb · April 7, 2022, 8:09am

Hi, Im trying to use wav2vec2 on a stream of audio built in the following way (this is not the all pipeline, just the important lines):

stream = pyaudio.PyAudio().open(
            format=pyaudio.paInt16,
            channels=self.CHANNELS,
            rate=16000,
            input=True,
            frames_per_buffer=self.INPUT_FRAMES_PER_BLOCK,
        )
data = stream.read(self.INPUT_FRAMES_PER_BLOCK)
data = np.frombuffer(data , np.int16)

It is important to mention that using my framework of streaming I enabled other deep algorithms, that works great.
I’m followed this tutorial.
Using the same pipeline (both in streamer and in offline) the only difference is that the offline is getting audio from torchaudio.load and other from stream which is pyint16.
Before I’ll start to talk about the normalization , I’ll first mention that I’m sure that loading the weights is completed successfully and even tested using dummy inputs.

My is that it will work on online stream, how would I know that it is working, I’ll see some “make sense” transcripts.
I saw that torchaudio used a normalization using a factor of (1<<31).
Since I’m using a stream with format pyint16, I tried using (1<<15).
Didn’t work so I tried to stream in a int32 format, and use the original factor, also didn’t work.
I further tried all kinds of normalizations, but nothing work.
Is there anything else I’m missing in order to make input stream aligned with what wav2vec2 expects to see?

Thanks for your help!