Hi, I’m new to audio signal processing and to pytorch and I’m having some trouble understanding this part of the docs of the torchaudio load function:

normalization (bool, number, or callable, optional) – If boolean True, then output is divided by 1 << 31 (assumes signed 32-bit audio), and normalizes to [-1, 1]. If number, then output is divided by that number If callable, then the output is passed as a parameter to the given function, then the output is divided by the result. (Default: True)

From what I understand, the function assumes the file to have a bit depth of 32 bit, however that bit depth is rather rare. Does 32-bit audio mean indeed bit depth or something else?

Also I don’t understand what is the meaning of output is divided by 1 << 31. What is meant by output and what is meant by 1 << 31?

I also assume 32-bit audio corresponds to the bits per sample.

1 << 31 is a left shift by 31 positions, so it translates to 1 << 31 == 2**31 == 2147483648, which would be the max value of each sample.
If I’m not mistaken, 32bit audio would have the range [−2,147,483,648, 2,147,483,647], so you would get a minimal error for the max positive value.

I also assume 32-bit audio corresponds to the bits per sample.

as if i normalize with True I get a tensor with max and min values [-1,1]:
max: 0.0881
min: -0.1289

while if I use normalization=16:
max: 11821056
min: -17301504

and for normalization=False:
max: 1.8914e+08
min: -2.7682e+08

indeed the std and avg of the data loaded using normalization True are between 0 and 1 so it seems like the correct normalization for the data I’m working with.