Is it possible to enable apex (opt_level=O1) only while inferencing on a model, purely trained on FP32 (without apex)

I want to run a few inference results on a pre-trained model trained purely on FP32 (without apex or amp). My main aim is to get faster results while inferencing and not while training. Hence, ideally, I want to initialize mixed precision learning after the model is trained and before the model is inputted with unseen data. How is it possible to do the same using NVIDIA apex library ? It’d be great if some code snippets can be attached too.

Secondly, for some model trained and inferred purely on FP32, let’s say modelA and a model trained on FP32 (without apex or amp) but inferred with apex (opt_level=‘O1’), let’s say modelB, how would the inference execution time of the below code for both the models differ ?

modelA = modelA.to(device) # to be inferred without apex
modelB = modelB.to(device) # to be inferred with apex
tensor = torch.rand(1,C,H,W).to(device) # random tensor for testing

with torch.no_grad():
    modelA.eval()
    modelA(tensor) # inferencing
    # Calculate cuda_time for the execution

with torch.no_grad():
    modelB.eval()
    # Initialize apex (opt_level='O1') code snippets for faster inferencing
    modelB(tensor) # inferencing
    # Calculate cuda_time for the execution

apex.amp is deprecated and you should use the native mixed-precision utility via torch.cuda.amp as described here.

With that being said, yes, it’s possible to activate autocast only for the inference. The docs give you some examples and in particular you can skip the training utils (e.g. the GradScaler).

Thankyou for this reply. I will go through the links you mentioned. But wanted to as one question.

We are interested in fast inference speed more than fast training. Does mixed precison using torch.cuda.amp allows faster inference also or speeds up only training?

As said before, it’s possible to activate autocast during inference and the actual speedup depends on the model architecture, cudnn version, GPU, etc.

1 Like

Thank you for your reply, it really works!!
I did several experiments with Resnet50 for input tensors of different spatial resolutions like below:

with torch.no_grad():
  model.eval()

  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)

  model(tensor) #warmup

  start.record()
  with torch.cuda.amp.autocast(enabled=True):
    for i in range(10):    
        model(tensor)
  end.record()
  torch.cuda.synchronize()
  print('execution time in MILLISECONDS: {}'.format(start.elapsed_time(end)/10))

with autocast in each case, I got at least 2x speedup during inference, which is really appreciable. But could you please tell why this speedup? What magic is PyTorch doing under the hood to get this speedup?

From the NVIDIA post:

Benefits of Mixed precision training

  • Speeds up math-intensive operations, such as linear and convolution layers, by using Tensor Cores.
  • Speeds up memory-limited operations by accessing half the bytes compared to single-precision.
  • Reduces memory requirements for training models, enabling larger models or larger minibatches.