Pytorch loss functions requires long tensor. Since I am using a RTX card, I am trying to train with float16 precision, furthermore my dataset is natively float16. For training, my network requires a huge loss function, the code I use is the following:
loss = self.loss_func(F.log_softmax(y, 1), yb.long())
loss1 = self.loss_func(F.log_softmax(y1, 1),
F.max_pool2d(yb, kernel_size=2, stride=2,
padding=0).long())
loss2 = self.loss_func(F.log_softmax(y2, 1),
F.max_pool2d(yb, kernel_size=4, stride=4,
padding=0).long())
loss3 = self.loss_func(F.log_softmax(y3, 1),
F.max_pool2d(yb, kernel_size=8, stride=8,
padding=0).long())
loss4 = self.loss_func(F.log_softmax(y4, 1),
F.max_pool2d(yb, kernel_size=16, stride=16,
padding=0).long())
avg_loss = (loss + (0.9*loss1) + (0.8*loss2) + (0.7*loss3) +
(0.6*loss4))/5
I always have to cast my output tensor to long from float16, and this requires a huge amount of time. My GPU is used only at 80% and I almost don’t benefit switching to float16 from float32.
I did a test using only the first loss (so the other four commented away) only for performance purpose: the GPU is used at 97% and benefits a lot from float16. With all the losses one epoch takes 7 minutes in float32, with only the first one it takes 5 in float32 (3 and half in 16 precision).
Furthermore if I profile my code it appears that 20 percent of time is in the “to” function (that I guess is used for casting). If I remove those lines it gets to 2%.
Clearly I need all the losses for my network to converge, but is there any way to keep the losses in float16 or float32 without casting to long?
It seems to me a great bottleneck and a big waste of my gpu time.