Is loss calculation typically done on GPU or on CPU?

I can’t find the post anymore, but I read on these forums that it is better (or at least faster) to calculate loss on the CPU. I wonder whether that is true. What are the pros and cons of calculating loss on the GPU vs CPU. Particular concerns are speed and memory consumption. My assumption is also that if there is a notable difference, that this is magnified with larger batches.

People often measure cuda times incorrectly (in this case it is even more likely as loss.item() is often a cuda synchronization point). In practice, loss calculation times should be too insignificant to be concerned. Same with memory.