RuntimeError: arange_out not supported on CPUType for Half

catosphere · June 23, 2019, 6:16pm

Hello,

While browsing for half precision or mixed precision training in pytorch, I found these tools https://nvidia.github.io/apex/index.html and particularly the apex.amp that should allow for a rather straight-forward switch between different FP for training.

However, when I run existing FP32 codes to mixed precision or FP16 I get the following:
RuntimeError: arange_out not supported on CPUType for Half
which seems to be caused by torch.stft that I use for computing spectral reconstruction losses.

… Any way to use stft with FP16 please ? (I use pytorch 1.1.0 and cuda 9.1.85)

Thanks ! And if anyone can recommend some other tools/ressources for mixed precision training in pytorch, I would be very thankful for more references as I did not get good results with apex.amp

ptrblck · June 23, 2019, 6:23pm

Some methods are not implemented for CPU tensors in FP16.
Could you post a code snippet showing, how you’ve used amp and where this issue is triggered?

Also, could you elaborate a bit more about your results and what went wrong using amp?

catosphere · June 23, 2019, 7:34pm

Thank you @ptrblck for answering my topic and picking up as well the issue I posted on github ! I put more details regarding amp in the github issue so I will try to answer you more details on this thread.

About the error when calling the stft operation. Since you answered about the difference for CPU and GPU tensors, I manually checked on ipython and indeed calling stft on a FP16 cpu tensor raises an error which is not the case for the same sent on a cuda device.

However, my code runs always on GPU. The line that causes the error in the loss calculation is

mag_grains = torch.norm(torch.stft(mb_grains,n_fft,hop_length=hop_size,win_length=win_size,window=torch.hann_window(win_size).to(device,non_blocking=True),center=False),dim=3)

mb_grains.type() == ‘torch.cuda.HalfTensor’ and device(type=‘cuda’, index=1) so everything should be in GPU memory … I dont understand what is the CPU data causing the error here, do you see it ?

About amp, are you okay with continuing the discussion on the github issue please ?

ps: upper in the code I put torch.set_default_dtype(torch.float16) to make sure that the window is created with the right dtype (as it is not automatically casted by amp in that case if I understood correct)

ptrblck · June 23, 2019, 8:05pm

Sure! We can debug amp specific problems in the issue.

Thanks for the information!
It looks like this problem is not amp-specific, but might be caused by setting the default dtype to torch.float16.
E.g. this code should throw the same error:

torch.set_default_dtype(torch.float16)

device = 'cuda'
input = torch.randn(100, device=device)
n_fft = 100
hop_size = 10
win_size = 10

res = torch.stft(input,n_fft,hop_length=hop_size,win_length=win_size,window=torch.hann_window(win_size).to(device,non_blocking=True),center=False)
print(res)

Setting the default type to float16 might most likely cause other errors, e.g.:

input = torch.randn(100).to(device)

will also cause an error:

RuntimeError: th_normal not supported on CPUType for Half

catosphere · June 24, 2019, 1:13pm

Thanks @ptrblck I will answer on the other issue about setting up correctly amp for speeding up trainings.

Regarding pytorch, I removed the set_default_dtype and instead pass a dtype argument to functions that process data but that are not automatically casted by amp at the input of the model forward.

I leave dtype=None if running opt_level at O0 (or running my original FP32 code), then data is not manually casted and is all the way created with the implicit FP32 default type.

For the other opt_levels, I set dtype=torch.float16 and inside the model functions, I cast data either:
when creating with argument dtype=dtype eg. for the window in the stft function
or with .type(dtype) eg. when I sample the model prior self.prior_distrib = distrib.Normal(torch.zeros(z_dim),torch.ones(z_dim)) it seems calling the sample function does not automatically cast the prior batch to FP16. So I do that manually before computing regularization.

Is this a better way to go for having a single code that can be run either in FP32, mixed or FP16 please ?

Now about the runtime error
mag_grains = torch.norm(torch.stft(mb_grains,n_fft,hop_length=hop_size,win_length=win_size,window=torch.hann_window(win_size,dtype=dtype).to(device),center=False),dim=3)
in this case, the spectral reconstruction error function is called with dtype=torch.float16 so accordingly, the window is created with a manual cast argument dtype=dtype.

I get the RuntimeError: arange_out not supported on CPUType for Half ; if inspecting with ipython I have *mb_grains.type() == ‘torch.cuda.HalfTensor’ *and then when calling torch.hann_window(win_size,dtype=dtype).to(device).type() instead of having as well ‘torch.cuda.HalfTensor’ (which I expected) I got the same RuntimeError …

Which pointed me to modify the line throwing the error as
mag_grains = torch.norm(torch.stft(mb_grains,n_fft,hop_length=hop_size,win_length=win_size,window=torch.hann_window(win_size).to(device).type(dtype),center=False),dim=3)
which runs now without error.

Is that an expected behavior ?

Now that the computation seems running all the way with FP16, I will look (with you on the other issue if you’re up for it) at possible speed gains with amp.

Thanks !