I have a deep network with 20 layers, where both the input and intermediate feature maps have very few channels. Here’s a simplified example:
import torch
from torch import nn
model = nn.Sequential(*[nn.Conv2d(2, 2, 3, padding=1) for _ in range(20)]).cuda()
x = torch.rand(16, 2, 64, 64).cuda()
while True:
model(x)
When running this on an NVIDIA RTX 4090 GPU, the utilization is around 15% or lower. This suggests the GPU isn’t fully utilized, leading to suboptimal inference speed.
Question:
How can I accelerate inference for such a deep and slim network to better utilize the GPU? The input shape and intermediate feature shapes are fixed in my case. Any suggestions or insights would be greatly appreciated.
These changes significantly boosted the training speed. The GPU utilization increased from 70% to 95%. Please note that my actual training script and network are much more complex than the toy example I previously posted.
Once again, thank you for your kind and valuable assistance.