Converting model to torch.float16

HamidGadirov · February 23, 2023, 6:15pm

I converted my 3D training data to float16 for memory issues, but now there is an error:
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
How can I convert the model to torch.float16 (or torch.half) before I start training?

The full error stack looks like this:

pred, info = model.update(imgs, gt, dataset, learning_rate, training=True)
  File "/home/hamid/Desktop/OpticalFlow/FlowSciVis/Flow-3D/model/RIFE.py", line 126, in update
    flow, mask, merged, flow_teacher, merged_teacher, loss_distill = self.flownet(torch.cat((imgs, gt), 1), scale=[1, 1, 1])
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hamid/Desktop/OpticalFlow/FlowSciVis/Flow-3D/model/IFNet.py", line 177, in forward
    flow, mask = stu[i](torch.cat((img0, img1), 1), None, scale=scale[i]) # stu[0]
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hamid/Desktop/OpticalFlow/FlowSciVis/Flow-3D/model/IFNet.py", line 94, in forward
    x = self.conv0(x)
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hamid/miniconda3/envs/gpu/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 567, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

ptrblck · February 23, 2023, 7:24pm

If you want to use “pure” float16 training, you would have to call model.half() to transform all parameters and buffers to float16, too.
We generally recommend using torch.cuda.amp for mixed-precision training as it will be more stable than a pure float16 training.

HamidGadirov · February 24, 2023, 7:57pm

Thanks @ptrblck, that solved the previous issue. Now I am starting the training with with torch.cuda.amp.autocast(True). I noticed however that the training takes more time per each iteration, could that be related to autocast mode? I am using NVIDIA TITAN V with CUDA Version 11.2.

ptrblck · February 24, 2023, 8:38pm

This should not be the case. Could you post your model definition as well as the input shapes here, please?

HamidGadirov · February 27, 2023, 2:02pm

I checked again and found that it is actually not the time per iteration that increased but the time between epochs. That is, the beginning of each new epoch takes increasingly more time and eventually slows down the training. Not sure why this happens, I have 3D volumetric data of size (128, 128, 128) with 4 channels (density and velocities) and I’m using the batch size of 45 at the moment, after I enabled the autocast mode. My model consists of 3 blocks with several convolutional layers per each and PReLU activations. I recently added batch normalization to stabilize training after activations but my loss increases at some point at the beginning of the training and becomes nan, I’ll need to find why.