How to Accelerate Inference for a Deep Slim Network?

Background:

I have a deep network with 20 layers, where both the input and intermediate feature maps have very few channels. Here’s a simplified example:

import torch
from torch import nn

model = nn.Sequential(*[nn.Conv2d(2, 2, 3, padding=1) for _ in range(20)]).cuda()
x = torch.rand(16, 2, 64, 64).cuda()
while True:
    model(x)

When running this on an NVIDIA RTX 4090 GPU, the utilization is around 15% or lower. This suggests the GPU isn’t fully utilized, leading to suboptimal inference speed.

Question:
How can I accelerate inference for such a deep and slim network to better utilize the GPU? The input shape and intermediate feature shapes are fixed in my case. Any suggestions or insights would be greatly appreciated.

Since your input shapes are static you could apply CUDA Graphs to your workflow as explained here manually or via torch.compile.

1 Like

I sincerely appreciate your suggestions. After reviewing the torch.compile documentation, I integrated the following lines into my training script:

import torch
torch.set_float32_matmul_precision('high')  # 1
...
torch.backends.cudnn.enabled = True  # 2
torch.backends.cudnn.benchmark = True  # 3
...
model = torch.compile(model, mode="max-autotune", fullgraph=True)  # 4
model = DDP(model, device_ids=[rank])
...

These changes significantly boosted the training speed. The GPU utilization increased from 70% to 95%. Please note that my actual training script and network are much more complex than the toy example I previously posted.

Once again, thank you for your kind and valuable assistance. :smile: