Why is ConvolutionBack being done on CPU?

My model, loss function tensors and data are on the gpu. I don’t understand why the autograd::engine::evaluate_function: ConvolutionBack... is being done on the cpu?

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
autograd::engine::evaluate_function: ConvolutionBack...        -6.12%  -281265.000us        48.36%        2.225s       4.278ms       0.000us         0.00%        2.305s       4.433ms           520  
                             aten::convolution_backward         1.39%      64.085ms         3.07%     141.237ms     271.610us        1.983s        57.42%        2.179s       4.190ms           520  
                                   ConvolutionBackward0         0.03%       1.461ms         3.08%     141.835ms     272.760us       0.000us         0.00%        2.081s       4.002ms           520  
sm86_xmma_wgrad_implicit_gemm_indexed_tf32f32_tf32f3...         0.00%       0.000us         0.00%       0.000us       0.000us     552.345ms        15.99%     552.345ms       5.523ms           100  
void cudnn::cnn::wgrad2d_grouped_direct_kernel<false...         0.00%       0.000us         0.00%       0.000us       0.000us     523.568ms        15.16%     523.568ms       7.480ms            70  
autograd::engine::evaluate_function: CudnnBatchNormB...        -1.40%  -64428.000us        10.31%     474.326ms     912.165us       0.000us         0.00%     485.467ms     933.590us           520  
void tensorTransformGeneric<float, float, float, tru...         0.00%       0.000us         0.00%       0.000us       0.000us     395.194ms        11.44%     395.194ms       1.976ms           200  
                        aten::cudnn_batch_norm_backward         0.16%       7.342ms         0.43%      19.975ms      38.413us     316.039ms         9.15%     333.625ms     641.587us           520  
                                CudnnBatchNormBackward0         0.03%       1.171ms         0.45%      20.899ms      40.190us       0.000us         0.00%     320.630ms     616.596us           520  
void cudnn::batchnorm_bwtr_nhwc_semiPersist<float, f...         0.00%       0.000us         0.00%       0.000us       0.000us     298.971ms         8.66%     298.971ms     747.428us           400  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  

This is my loss function (I’m, using a regular Adam optimizer):

def loss_function(predictions, probabilities, target_prediction, prediction_loss_weight=torch.tensor(1.0)):
    torch_empty = _torch_empty_factory(predictions)
    torch_zeros = _torch_zeros_factory(predictions)

    modes = predictions[0].shape[0]
    batch_losses = torch_empty(predictions.shape[0], 1)

    for batch_i, batch in enumerate(predictions):
        distances = torch_empty(modes, 1)
        for mode_i, mode in enumerate(batch):
            distances[mode_i] = _trajectory_distance(
                predictions[batch_i][mode_i],
                target_prediction[batch_i]
            )

        _, closest_trajectory_i = torch.min(distances, dim=0)
        closest_trajectory = predictions[batch_i][closest_trajectory_i].squeeze(0)

        l1_loss = F.smooth_l1_loss(closest_trajectory, target_prediction[batch_i])

        target_probability = torch_zeros(modes)
        target_probability[closest_trajectory_i] = 1

        confidence_loss = F.cross_entropy(probabilities[batch_i], target_probability)
        loss = (prediction_loss_weight * l1_loss) + confidence_loss

        batch_losses[batch_i] = loss

    return torch.mean(batch_losses)

def _trajectory_distance(pred, target):
    return torch.norm(pred - target)

def _torch_empty_factory(tensor):
    def F(*args, **kwargs):
        return tensor.new_empty(*args, **kwargs)
    return F

def _torch_zeros_factory(tensor):
    def F(*args, **kwargs):
        return tensor.new_zeros(*args, **kwargs)
    return F

I would recommend using a visual profiler allowing you to check the actual timeline and to see where the actual bottleneck is. Maybe the host is just accumulating time due to e.g. a sync.

Went ahead and did that, looks like it was cudaMemcpyAsync

Then I realized I had left set_detect_anomaly on. I think setting that to false solved my issue and that’s what was causing the long memcpy.

I don’t see a long memcpy as it seems the kernel was launched in advance and is executed later (which is desired). However, I also use Nsight Systems and am not too familiar with chrome traces.