Will pytorch native mixed precision training be slower than apex when there are multi-losses?

Hi,

A simplified version of my training pipe line is like this:

import torch.cuda.amp as amp

scaler = amp.Scaler()

optim.zero_grad()
with amp.autocast(enabled=True):
    logits1, logits2, logits3 = model(imgs)
    loss1 = criteria(logits1, labels)
    loss_aux = [criteria(logits2, labels), criteria(logits3, labels)]   
    loss = loss1 + sum(loss_aux)
scaler.scale(loss).backward()
scaler.step(optim)
scaler.update()

While the apex version is like this:

from apex import amp

optim.zero_grad()
logits1, logits2, logits3 = model(imgs)
loss1 = criteria(logits1, labels)
loss_aux = [criteria(logits2, labels), criteria(logits3, labels)]   
loss = loss1 + sum(loss_aux)

with amp.scale_loss(loss, optim) as scaled_loss:
    scaled_loss.backward()
optim.step()

I am using pytorch 1.6.0 with python3.6.9 on ubuntu 16.04 (docker container) with cuda 10.1.243/cudnn7.

There are two things that I feel puzzled:
The first one is that I received a warning like this:

/miniconda/envs/py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

though I am sure I called the lr_scheduler after the scaler.step(optim), and when I set enabled=False to autocast(enabled=False), there will not be this warning.

The second is that the pytorch native version is much slower than the apex version. When I use autocast of pytorch, the training time for 100 iter is 110s, and when I use apex, the training time is around 80s.

Is there any problem with my usage of this new feature and how could I solve the problem ?

The learning rate scheduler warning might be raised, if the gradients contained invalid values and the optimizer.step() was thus skipped.

Which opt_level in apex are you using and did you use the same CUDA and cudnn versions for the comparison?

Thanks for replying!! I am using apex O1 opt level, and yes I tried the new pytorch 1.6 feature on the same platform as the original apex codebase. The pytorch native fp16 training is slower than the apex based fp16 training.

Could you update to the CUDA10.2 binaries (which would ship with cudnn7.6.5.32) and post the general model architecture? You don’t need to post the exact model, if that wouldn’t be possible, but a “similar” architecture would be sufficient.

Hi,

I am happy that you would like to review my code. I created a docker image, and you can see my problem by pulling and running it:

$ docker pull coincheung/debug_zzy:first
$ nvidia-docker run -it --ipc=host coincheung/debug_zzy:first bash

To reproduce the apex training, the command is:

# cd /debug/bisenetv2
# CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node=2 train.py

And in order to switch to the pytorch native fp16, the command is:

# cd /debug/bisenetv2
# CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node=2 train_amp.py

On my platform, the training time of apex is around 85s/100iter, and the pytorch native fp16 traing time is around 115s/100iter.