Tackling Low GPU Kernel Occupancy During Loss Function Computation

Hello! We’re currently encountering a problem. Our loss function takes an extended period to compute, yet the occupancy of the GPU kernel is extremely low. We attempted to run each loss function in a separate process, but autograd isn’t capable of managing inter - process propagation. We’re wondering if there’s a way to parallelize the loss function. For example, could we detach the relevant tensors during the forward propagation and manually propagate the gradients during the backward propagation?

Would multithreading work instead for your case?