Strange slowness of mean operation

Attached is the profile summary of a pytorch network training session (2 epochs):

As can be seen, a large part of the time (32%) is spent in the ‘mean’ method of the ‘multilabel_soft_margin_loss’. The networks that I am training have several layers of large (1024) FC layers, so I would expect that most of the time would be spent in the FC layers not in the ‘mean’ method. All tensors reside on the GPU.
The reason I profiled my code, was that I it doesn’t achive full utilization of the GPU (It fluctuates between 60-90% utilization).

Will appreciate any advice on how to improve the performance of the code.

Thanks.

I am not an expert, but according to this: Make torch.nn.functional.multilabel_soft_margin_loss more stable #9141 and use logsigmoid at multilabel_soft_margin_loss to make it more stable this loss was unstable. I don’t know which version of PyTorch you are using, but maybe you should try to use the latest release.

Thank you @savchynt . I am using torch ‘1.0.0.dev20181014’ so I have this change in the code.