Is it possible to enable apex (opt_level=O1) only while inferencing on a model, purely trained on FP32 (without apex)

I want to run a few inference results on a pre-trained model trained purely on FP32 (without apex or amp). My main aim is to get faster results while inferencing and not while training. Hence, ideally, I want to initialize mixed precision learning after the model is trained and before the model is inputted with unseen data. How is it possible to do the same using NVIDIA apex library ? It’d be great if some code snippets can be attached too.

Secondly, for some model trained and inferred purely on FP32, let’s say modelA and a model trained on FP32 (without apex or amp) but inferred with apex (opt_level=‘O1’), let’s say modelB, how would the inference execution time of the below code for both the models differ ?

modelA = # to be inferred without apex
modelB = # to be inferred with apex
tensor = torch.rand(1,C,H,W).to(device) # random tensor for testing

with torch.no_grad():
    modelA(tensor) # inferencing
    # Calculate cuda_time for the execution

with torch.no_grad():
    # Initialize apex (opt_level='O1') code snippets for faster inferencing
    modelB(tensor) # inferencing
    # Calculate cuda_time for the execution

apex.amp is deprecated and you should use the native mixed-precision utility via torch.cuda.amp as described here.

With that being said, yes, it’s possible to activate autocast only for the inference. The docs give you some examples and in particular you can skip the training utils (e.g. the GradScaler).

Thankyou for this reply. I will go through the links you mentioned. But wanted to as one question.

We are interested in fast inference speed more than fast training. Does mixed precison using torch.cuda.amp allows faster inference also or speeds up only training?

As said before, it’s possible to activate autocast during inference and the actual speedup depends on the model architecture, cudnn version, GPU, etc.

1 Like