Torchvision model vit_b_16 fails to train with AMP

pallgeuer · October 10, 2022, 12:39pm

I am trying to use the vit_b_16 torchvision model with AMP (torch.cuda.amp / torch.autocast):
https://pytorch.org/vision/main/models/generated/torchvision.models.vit_b_16.html

I am encountering the error:

Traceback (most recent call last):
  File "/home/phil/Code/ReLish/benchmark/train_cls.py", line 95, in main
    train_model(C, train_loader, valid_loader, model, output_layer, criterion, optimizer, scheduler)
  File "/home/phil/Code/ReLish/benchmark/train_cls.py", line 455, in train_model
    output = model(data)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torchvision/models/vision_transformer.py", line 298, in forward
    x = self.encoder(x)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torchvision/models/vision_transformer.py", line 157, in forward
    return self.ln(self.layers(self.dropout(input)))
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torchvision/models/vision_transformer.py", line 113, in forward
    x, _ = self.self_attention(query=x, key=x, value=x, need_weights=False)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/phil/anaconda3/envs/relish/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 1113, in forward
    return torch._native_multi_head_attention(
RuntimeError: expected scalar type Half but found Float

The error occurs in the first epoch AFTER training is complete, right when the validation should start. Without AMP, no error occurs. With every other model I’ve tried (including Swin transformers) there is no issue.

I have found this, but am not sure how it would apply:

github.com/NVIDIA/apex

RuntimeError: expected scalar type Half but found Float

opened 05:32PM - 10 Jan 19 UTC

entavelis

Hello, I am trying to use amp to have mixed precision training for my model. I a…m implementing a multi-task learning algorithm, so my loss is a summation of loss_a and loss_b, which I then back-propagate. When doing so and following amp usage instructions I get: _RuntimeError: expected scalar type Half but found Float_ I also tried, back-propagating one loss at the time, while retaining the graph, following your instructions for multiple losses. I get the same error. In the previous cases, I didn't use model.half(). When I did, my losses started becoming NaN. Do you have any suggestions on how to proceed? Thanks in advance! VglsD *Edit: I should also note that my network has Batch Normalization Layers Environment: Miniconda, Python 3.7, Cuda 10, PyTorch 1.0, apex master branch

Maybe it’s something similar to this?

This is the code I’m using:

github.com

pallgeuer/ReLish/blob/9a31066c3601821f4be8d2c15ace0d2171b8d342/benchmark/train_cls.py#L451


      
          		init_detail_stamp = last_detail_stamp = timeit.default_timer()
          
          		with torch.inference_mode():
          			for batch_num, (data, target_cpu) in enumerate(valid_loader):
          
          				num_in_batch = data.shape[0]
          				data = data.to(device, non_blocking=True)
          				target = target_cpu.to(device, non_blocking=True)
          
          				with torch.autocast(device_type=device.type, enabled=amp_enabled):
          					output = model(data)
          					mean_batch_loss = criterion(output if output_layer is None else output_layer(output), target)
          
          				num_valid_samples += num_in_batch
          				output_cpu = output.detach().to(device=cpu_device, dtype=float)
          				output_nans += torch.count_nonzero(output_cpu.isnan()).item()
          				batch_topk_sum = calc_topk_sum(output_cpu, target_cpu, topn=5)
          				for k in range(5):
          					valid_topk[k] += batch_topk_sum[k]
          				valid_loss += mean_batch_loss.item() * num_in_batch

For reference I call it as (Python 3.9, PyTorch 1.12.1, CUDA 11.6, cuDNN 8.3.2):

./train_cls.py --act_func=original --batch_size=32 --dataset=Imagenette --epochs=120 --model=vit_b_16

Adding --no_amp to the command line turns AMP off.

Can anyone help?

ptrblck · October 10, 2022, 5:17pm

The error seems to be raised by the MHA layer. Could you check if the latest nightly binary still rauses the error, please?

pallgeuer · October 12, 2022, 7:30am

No, the error is not there under pytorch-nightly/linux-64::pytorch-1.14.0.dev20221011-py3.9_cuda11.6_cudnn8.3.2_0. What is/was the problem, and can I do anything to overcome the issue now, besides moving to a nightly PyTorch release? (the nightly release broke other unrelated parts of my code related to torch.jit.script, which I had to just comment out otherwise it would crash before I have the chance to train)

ptrblck · October 12, 2022, 8:08am

I don’t know what the fix was but you could check the commit e.g. via git blame to check which issues were fixed and if some of them matched your error.