Cpu at 100% while GPU remains at 1% on transforms

Here is the code that I isolate to reproduce the problem.
When doing the inference, the Cpu has 100% load, and the Gpu remains with almost no load.
This is for the inference, so I provide an image that is not heavy, about 400x400px. The model has also a very low wait. Then, the consuming load is done here : image_transformed = transform(image).unsqueeze(0).to(device)
I can’t provide batches or so, so it’s a new image inspected every time… How to reduce the load of the cpu? The inference for the classifier takes only 5ms, but with the cpu at 100%, and my applications runs out of cycle for the other main tasks because of this.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device:", device)


def main():

    # Load and preprocess an image for inference
    image = Image.open(image_path)

    ###
    # Define Image transformations
    transform = transforms.Compose(
        [transforms.Resize((224, 224)), #224 for the resNet50
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])

    trainset = torchvision.datasets.ImageFolder(root=PathTrain, transform=transform)
    classes = trainset.classes

    model = torchvision.models.squeezenet1_1(weights=True)
    model.classifier[1] = nn.Conv2d(512, len(classes), kernel_size=(1, 1))
    model.num_classes = len(classes)

    model.load_state_dict(torch.load(PathModel))
    model.eval()

    model.to(device)
    ###

    while True:
            #
            try:
                start_time = time.time()

                # Process image
                image_transformed = transform(image).unsqueeze(0).to(device)

                with torch.no_grad():
                    output = model(image_transformed)
                
                _, indice = torch.max(output, 1)

                elapsed_time = (time.time() - start_time) * 1000
            
            except Exception as e:
                response = 'ERROR'

            finally:
                time.sleep(0.1) 


if __name__ == "__main__":
    main()

You could try to use CUDA Graphs to reduce the CPU workload as it seems it might be the bottleneck in your application.

Thanks for the tip. I am reading about it, so it’s unknown for me. If you have an example or link, I will appreciate that.

You can check the docs here which explain how to manually use it (or via a util. function), but the easier approach might be to use torch.compile with mode="reduce-overhead" which should apply it for you (assuming your model is compatible with CUDA Graphs).

Isn’t compile not supported in Windows? RuntimeError: Windows not yet supported for torch.compile

Yes, I don’t think torch.compile is supported on Windows due to the lack of OpenAI/Triton support on Windows.

Mm so I am using Windows. I would like to understand why a bunch of 6 cameras grabbing at 60fps takes only 10% of cpu, and the cpu->gpu at just 30 images/s takes the 100% cpu, even multithreading or any parallelization task. So, Is not any other method to accelerate/reduce cpu load? Should this values of 100% cpu and 1% gpu accepted as ‘normal’ for inference in production?