Tackling Low GPU Kernel Occupancy During Loss Function Computation

dongyang2011 · March 31, 2025, 2:43am

Hello! We’re currently encountering a problem. Our loss function takes an extended period to compute, yet the occupancy of the GPU kernel is extremely low. We attempted to run each loss function in a separate process, but autograd isn’t capable of managing inter - process propagation. We’re wondering if there’s a way to parallelize the loss function. For example, could we detach the relevant tensors during the forward propagation and manually propagate the gradients during the backward propagation?

soulitzer · April 7, 2025, 2:39pm

Would multithreading work instead for your case?