GPU utilisation is only at 7%

Hi !

I am trying to run an algorithm that generates a depth map using pytorch in python. The result is very slow, when I check with nvidia-smi the utilization caps at 7%. I checked and my algorithm is using CUDA (torch.cuda.is_available() returns True), I try some verification and everithing seems to be ok. I am a beginner in all this and I have scoured the internet to solve the problem but nothing seems to work. Do you know what could be the problem?



This can happen when there is a lot of CPU overhead in some other part of the code relative to the amount of GPU computation. Can you post a runnable code snippet (e.g., something that corresponds to what you think the slowest part of the code) so we can reproduce and diagnose the issue?

Thank you for your reply. To explain a little more my code, I use a small drone (Tello) which captures images of the corridor and transmits them to my computer by WIFI. From there, I process the recovered images with the following method that seems to be very slow :

def test_simple(source, target, encoder_dict, intrinsics_json_path, pose_enc, pose_dec):
    """Function to predict for a single image or folder of images

    # Load input data
    input_image, original_size = load_and_preprocess_image(target,

    source_image, _ = load_and_preprocess_image(source,

    K, invK = load_and_preprocess_intrinsics(intrinsics_json_path,

    with torch.no_grad():

        # Estimate poses
        pose_inputs = [source_image, input_image]
        pose_inputs = [pose_enc(, 1))]
        axisangle, translation = pose_dec(pose_inputs)
        pose = transformation_from_parameters(axisangle[:, 0], translation[:, 0], invert=True)

        # Estimate depth
        output, lowest_cost, _ = encoder(current_image=input_image,

        output = depth_decoder(output)

        sigmoid_output = output[("disp", 0)]
        sigmoid_output_resized = torch.nn.functional.interpolate(
            sigmoid_output, original_size, mode="bilinear", align_corners=False)
        sigmoid_output_resized = sigmoid_output_resized.cpu().numpy()[:, 0]

        toplot = toplot.squeeze()
        normalizer = mpl.colors.Normalize(vmin=toplot.min(), vmax=np.percentile(toplot, 95))
        mapper = cm.ScalarMappable(norm=normalizer, cmap='magma')
        colormapped_im = (mapper.to_rgba(toplot)[:, :, :3] * 255).astype(np.uint8)
        #im = pil.fromarray(colormapped_im)
        return colormapped_im

Tell me if it’s not clear, or if you want something more runnable for you.

Here you call a number of methods that we do not know what they do, so it could be a bit hard to run the code you provided. A good idea could be to post a code snippet that anyone could just copy, paste it into a file and run it directly. For this you could use torch.rand() tensors with the dimensions of your expected inputs.

For me, an idea comes up to mind, which is to print the amount of time each method takes to run. That way you could see where your bottleneck is.