Torchaudio feature extraction

I have been following the tutorial for feature extraction using pytorch audio here: torchaudio.pipelines — Torchaudio 0.10.0 documentation

It says the result is a list of tensors of lenth 12 where each entry is the output of a transformer layer. So, the first tensor on the list has shape of something like (1,2341,768).

It seems to be correct as I get this result for most audios.

However, for some videos, I get returned a tensor of length 12, but the entries have more than 1 batchsize bizzarely. So the shape is (2,2341,768) I am baffled as to why this is?

Any clues would be great.

I don’t know which code you are executing, but the linked example code seems to work for me:

import torchaudio

bundle = torchaudio.pipelines.HUBERT_BASE

model = bundle.get_model()

waveform = torch.randn(1, 1000)
features, _ = model.extract_features(waveform)
for f in features:
    print(f.shape)
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])
# torch.Size([1, 2, 768])

waveform = torch.randn(2, 1000)
features, _ = model.extract_features(waveform)
for f in features:
    print(f.shape)
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])
# torch.Size([2, 2, 768])

waveform = torch.randn(16, 1000)
features, _ = model.extract_features(waveform)
for f in features:
    print(f.shape)
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])
# torch.Size([16, 2, 768])

Do you see the same behavior or does it differ?

@ptrblck thank you so much for your answer.
So, what you have shown is correct and the behaviour is the same for me.

The issue is that some of the audio I have turns out to have tow channels (I cross posted the Q on stack and one of the replies helped me find this). I have some 1k vids and many of them have a single channel but some have two channels.

So,

            data_waveform, rate_of_sample = torchaudio.load(audio_data)
            print(data_waveform.shape)
            sys.exit()

While for most of them I get a shape (1,some_int) I get for example torch.Size([2, 3519168]) for some of them. I am told this is because of mono/stereo.

The question now for me is what us best way to deal with this? I am very new to dealing with audio.

Curently, I compute a mean of the two channels. Not sure if this makes sense?

It depends on your task and the scenario of the stereo recording. If your task is ASR and the speech in the stereo recording is close to the microphones, it is okay to average the two channels.

If the speech is very far away from the microphones and the time of arrival to the two microphones are different, I would recommend only choosing one channel for extracting features. You can verify it by listening to the audios, if the speaker sounds like they are from left or right direction instead of front, the time of arrival is different between channels.