Model inference takes 1 s! every 8 passes, otherwise <1 ms

Hello !! Please help. Thank you.

Inference is <1 ms but every 8 times or so 1000 ms! I’m using pytorch 2.0.1 and have tried many things so far as you can see in the code. I’m on a windows computer and this happens on 2 computers.

here is my code:

class SegModel():
def init(self, parameters):
gc.collect()
torch.cuda.empty_cache()
os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’
ENCODER = “vgg19_bn”

ENCODER_WEIGHTS = ‘imagenet’

CLASSES = [‘my class’]
ACTIVATION = ‘sigmoid’ # could be None for logits or ‘softmax2d’ for multiclass segmentation
DEVICE = ‘cuda’
in_channels = 3

    # create segmentation model with pretrained encoder
    model = seg.UnetPlusPlus(
        encoder_name=ENCODER,
        # encoder_weights=ENCODER_WEIGHTS,
        classes=len(CLASSES),
        activation=ACTIVATION,
        in_channels=in_channels,

    )
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.enabled = False
    torch.backends.cudnn.deterministic = True
    torch.set_flush_denormal(True)
    torch.no_grad()
    torch.set_grad_enabled(False)
    torch.jit.enable_onednn_fusion(True)
    ENCODER_WEIGHTS = utils.get_state_dict(parameters.path_to_weights)
    model.load_state_dict(ENCODER_WEIGHTS)
    model = model.half()
    model.eval()
    self.model = model.to(DEVICE)
    self.batch_size = 1        
    torch.cuda.synchronize()
    t0 = int(time())
    self.model = self.warmup_operation(40)
    torch.cuda.synchronize()
    t1 = int(time())
    print('total_warmup_time', (t1 - t0) * 1000)


def warmup_operation(self, n_warmup_interations):
    # gc.collect()
    # torch.cuda.empty_cache()
    dummy = torch.ones((self.batch_size, 3, 256, 256),dtype=torch.float16, device='cuda')/0.5
    traced_model = torch.jit.trace(self.model, dummy)
    traced_model = torch.jit.freeze(traced_model)
    # n_warmup_interations = 40
    for iter in range(n_warmup_interations):
        # torch.cuda.empty_cache()
        torch.cuda.synchronize()
        t0 = int(time())
        traced_model(dummy)
        torch.cuda.synchronize()
        t1 = int(time())
        print('warmup_time', (t1 - t0) * 1000)
    # # self.model(dummy)
    #
    return traced_model

when I instantiate the model above it runs the warmup_operation method and prints out the times in ms below:

warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 1000
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 1000
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 1000
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 1000
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 0
warmup_time 1000
warmup_time 0
warmup_time 0
total_warmup_time 9000

Process finished with exit code 0

The above happens when doing inference with the model for real images.
This is a huge mystery to me and I need to solve it since I have a real time application.

Thank you !!

Lisa Koenigsberg

Remove the rounding by dropping the int() call around the reporting, as the actual execution time might be in ms:

for _ in range(10):
    t0_float = time.time()
    t0 = int(t0_float)
    
    time.sleep(1e-1)
    
    t1_float = time.time()
    t1 = int(t1_float)
    
    print(t1_float - t0_float)
    print(t1 - t0)

Output:

0.10011005401611328
0
0.10011029243469238
0
0.10013604164123535
1
0.10013532638549805
0
0.10013484954833984
0
0.10013461112976074
0
0.10011577606201172
0
0.1001133918762207
0
0.10011577606201172
0
0.1001138687133789
0

Yes, that certainly is it ! Thank you very much.
Now that I fixed that, on my computer with a batch_size of 8 it takes 125 ms per pass which is too long for my application. Besides using a lighter model do you have any tips for what to try ?

Thank you !

You could profile your code to narrow down the bottleneck on the workflow.
Based on this you could check what to optimize. E.g. if your use case if CPU-limited you might want to apply CUDAGraphs as described here. Also, you could try the latest torch.compile util. (also with CUDAGraphs if needed) and check if kernel fusion and other optimizations could help.