Hi,
A simplified version of my training pipe line is like this:
import torch.cuda.amp as amp
scaler = amp.Scaler()
optim.zero_grad()
with amp.autocast(enabled=True):
logits1, logits2, logits3 = model(imgs)
loss1 = criteria(logits1, labels)
loss_aux = [criteria(logits2, labels), criteria(logits3, labels)]
loss = loss1 + sum(loss_aux)
scaler.scale(loss).backward()
scaler.step(optim)
scaler.update()
While the apex version is like this:
from apex import amp
optim.zero_grad()
logits1, logits2, logits3 = model(imgs)
loss1 = criteria(logits1, labels)
loss_aux = [criteria(logits2, labels), criteria(logits3, labels)]
loss = loss1 + sum(loss_aux)
with amp.scale_loss(loss, optim) as scaled_loss:
scaled_loss.backward()
optim.step()
I am using pytorch 1.6.0 with python3.6.9 on ubuntu 16.04 (docker container) with cuda 10.1.243/cudnn7.
There are two things that I feel puzzled:
The first one is that I received a warning like this:
/miniconda/envs/py36/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
though I am sure I called the lr_scheduler
after the scaler.step(optim)
, and when I set enabled=False
to autocast(enabled=False)
, there will not be this warning.
The second is that the pytorch native version is much slower than the apex version. When I use autocast
of pytorch, the training time for 100 iter is 110s, and when I use apex
, the training time is around 80s.
Is there any problem with my usage of this new feature and how could I solve the problem ?