Problem with long float for loss function

Pytorch loss functions requires long tensor. Since I am using a RTX card, I am trying to train with float16 precision, furthermore my dataset is natively float16. For training, my network requires a huge loss function, the code I use is the following:

loss = self.loss_func(F.log_softmax(y, 1), yb.long())                                                                                                        

loss1 = self.loss_func(F.log_softmax(y1, 1),                              
                                   F.max_pool2d(yb, kernel_size=2, stride=2,          
                                   padding=0).long())                                                                                                                   
loss2 = self.loss_func(F.log_softmax(y2, 1),                              
                                   F.max_pool2d(yb, kernel_size=4, stride=4,          
                                   padding=0).long())                                                                                                                 

loss3 = self.loss_func(F.log_softmax(y3, 1),                              
                                   F.max_pool2d(yb, kernel_size=8, stride=8,          
                                   padding=0).long())                                                                                                                

loss4 = self.loss_func(F.log_softmax(y4, 1),                              
                                   F.max_pool2d(yb, kernel_size=16, stride=16,        
                                   padding=0).long())                                 
                                                                                 
avg_loss = (loss + (0.9*loss1) + (0.8*loss2) + (0.7*loss3) +              
                    (0.6*loss4))/5                   

I always have to cast my output tensor to long from float16, and this requires a huge amount of time. My GPU is used only at 80% and I almost don’t benefit switching to float16 from float32.

I did a test using only the first loss (so the other four commented away) only for performance purpose: the GPU is used at 97% and benefits a lot from float16. With all the losses one epoch takes 7 minutes in float32, with only the first one it takes 5 in float32 (3 and half in 16 precision).

Furthermore if I profile my code it appears that 20 percent of time is in the “to” function (that I guess is used for casting). If I remove those lines it gets to 2%.

Clearly I need all the losses for my network to converge, but is there any way to keep the losses in float16 or float32 without casting to long?
It seems to me a great bottleneck and a big waste of my gpu time.

That depends which loss function are you using?

You are right, I am using nn.NLLLoss

Well the second argument is supposed to be the indices of the label. So it cannot really be a floating point type.
The Long type is the only one we have that guarantees that you will be able to index any Tensor.

Note that these labels should be fixed, so you could precompute the poolings for all of them and just load the precomputed version no?

I understand. Would I benefit from using a binary cross entropy? I only have two classes. In this case would I be restricted to long too?
The idea of precompute is good, the only problem is that now my dataset is already a 150 Gb of space…

BCE loss might help, not sure if it has the same constraint on the target.

Well, I don’t know how big are your images but the labels don’t usually take much space: it is a single number. Doing this precompute would make it contain 5numbers. A 250x250 image has almost 190 000 numbers. So these extra 4 should not change much in the size of your dataset if the feature are big enough.

No actually it’s a semantic segmentation network. So the “labels” are for each pixel. There are only two possible classes for the pixels. Each image is 2560x256.
All my doubts started because I’ve just changed GPU, now I have a RTX, I was trying to use FP16 training that should enable tensor cores (with apex) but I have absolutely no benefit (actually it is almost slower…). I was trying to understand if this part that is constrained not to use FP16 may be the problem.