I’m an intermediate PyTorch user (only did a vision project before) who wants to use torchaudio for something new.
I’m making currently making a dataset loader for the NSynth dataset (16bit 16kHz PCM wav files) specific to my application.
As I’ll be training on an RTX GPU, and the data is originally 16bit, I was wondering if it would be smart to use float16 in this case. I’m not exactly sure about this though, given my limited knowledge of signal processing. torchaudio.load returns 32bit floats by default and does not give the option to load float16, so I was wondering whether there exists a theoretical reason.
I don’t think the data loading should be performed in FP16, as you might end up with some quantization noise.
If I’m not mistaken, the 16bit audio files represent 65536 different levels.
Since FP16 cannot represent all integers >2048 (Wikipedia - FP16), you’ll lose some information.
That being said, once you’ve loaded and preprocessed the data, you could still use FP16 for the model training. Have a look at apex/amp for an automatic mixed-precision approach.
Actually, 64bit float can maintain the original information without any loss. According to my experience, the amount of information lost from using float32 is actually quite small.
FP32 is able to represent all integers in
[-16777216, 16777216], which should thus work for these audio files. Why would you lose information in this use case or did I misunderstood your explanation?
I did some googling and found that indeed I was wrong. My original conclusion was based on the fact that when I used librosa to load wav files and set the precision to fp32, some values were rounded. But in the case of fp64, the exact value are shown. I guess perhaps that was due to display issue? I’m not sure what’s mechanism behind that.
Thanks! Makes a lot of sense. I think I’ll stay away from apex for now, I don’t want to overcomplicate things. I guess I can always start with FP32 and move to FP16 later and compare.