Torchaudio backend change on Windows

I’m trying to emulate Patrick von Platen’s walkthrough on fine-tuning the XLS-R wav2vec 2.0 model. Everything runs fine and all outputs seem correct when I run it in the provided Google Colab notebook, but I encountered an error when I tried to download and run the notebook locally through Jupyter on my Windows system.

The walkthrough references Mozilla’s Common Voice dataset and defines a function and mapping to prepare the dataset for training:

import torchaudio
import soundfile
torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE = False

def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names)
common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names)

However, running the code above results in a runtime error since torchaudio looks for sox (as in the Google Colab) instead of soundfile:


RuntimeError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\datasets\features\audio.py in _decode_mp3(self, path_or_file)
170 try:
→ 171 torchaudio.set_audio_backend(“sox_io”)
172 except RuntimeError as err:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\torchaudio\backend\utils.py in set_audio_backend(backend)
43 raise RuntimeError(
—> 44 f’Backend “{backend}” is not one of ’
45 f’available backends: {list_audio_backends()}.')

RuntimeError: Backend “sox_io” is not one of available backends: [‘soundfile’].

The above exception was the direct cause of the following exception:

ImportError Traceback (most recent call last)
in
----> 1 common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names)
2 common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\datasets\arrow_dataset.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
2118 new_fingerprint=new_fingerprint,
2119 disable_tqdm=disable_tqdm,
→ 2120 desc=desc,
2121 )
2122 else:

ImportError: To support decoding ‘mp3’ audio files, please install ‘sox’.

Note that I’m using the following packages:

datasets==1.18.3
transformers==4.11.3
torchaudio==0.10.0
jiwer==2.5.0
soundfile==0.9.0

Also note that I have installed the latest version of PySoundFile (to my knowledge) and downloaded sox version 14.4.2 and added associated path to environment.

Is there any way for me to tell torchaudio to use soundfile as backend rather than sox? Am I missing something fundamental?