FP16 (AMP) training slow down with PyTorch 1.6.0


I’m experiencing strange slow training speed with PyTorch 1.6.0+AMP.

I built 2 docker images, and the only difference between them is one have torch 1.5.0+cu101 and the other have torch 1.6.0+cu101. On these two docker images, I ran same code (Huggingface xlmr-base model for token classification) on same hardware (P40 GPU), with no distributed data parallel or gradient accumulation. The table below summarizes the training speed I got:

samples/s PyTorch 1.5.0 PyTorch 1.6.0 diff
FP32 51.97 51.57 -0.4
FP16 with apex.amp O1 51.25 47.43 -3.82
FP16 with apex.amp O2 56.68 49.09 -7.59
FP16 with torch.cuda.amp N/A 47.17 N/A

Comparing between 1.5.0 and 1.6.0, the FP32 speeds are close, but the speeds with APEX AMP O1 and O2 are both significantly slower with 1.6.0. PyTorch 1.6.0 native AMP is also much slower compared to 1.5.0+apex.amp. All 3 FP16 AMP configurations with 1.6.0 are slower than FP32.

Again, only difference is the PyTorch version in the docker images. Other things common in the images are: cuda 10.1, cudnn, python 3.6.8,

Do you have any suggestions on this problem?


Upon further investigation, I tried to build a docker imaged by extending pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel, and do same experiments on it. Now I get comparable results on 1.6.0 in it:

samples/s docker image A docker image B
PyTorch 1.5.0 PyTorch 1.6.0 PyTorch 1.6.0
FP32 51.97 51.57 51.44
FP16 with apex.amp O1 51.25 47.43 50.57
FP16 with apex.amp O2 56.68 49.09 55.98
FP16 with torch.cuda.amp N/A 47.17 48.26

In docker image A, I have:

FROM nvidia/cuda:10.1-devel-ubuntu18.04

ENV PYTORCH_VERSION=1.6.0+cu101 # or 1.5.0+cu101
ENV TORCHVISION_VERSION=0.7.0+cu101 # or 0.6.0+cu101
RUN apt-get update && \
    apt-get install -y --allow-change-held-packages --allow-downgrades --no-install-recommends \
    libcudnn7=${CUDNN_VERSION} \
    libcudnn7-dev=${CUDNN_VERSION} \

RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION} -f https://download.pytorch.org/whl/torch_stable.html

In docker image B, I have:

FROM pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel

So my new question is what could be the issue with docker image A that causes FP16 speed regression from 1.5.0 to 1.6.0? Am I installing pytorch correctly in it?

Thank you

1 Like

Just to be sure that you did a fair bemchmark: did you set all seeds, set cudnn deterministic mode and disabled its benchmark mode? Different seeds can mean differently shuffled batches, meaning which might be more efficient in some cases vs. others (depending on how the padding is done). Something like this:

def set_seed(seed: int):
    """Set all seeds to make results reproducible (deterministic mode).
       When seed is None, disables deterministic mode.
    :param seed: an integer to your choosing
    if seed is not None:
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        os.environ['PYTHONHASHSEED'] = str(seed)
1 Like

Thank you for the quick reply. Yes the seeds are all set. The loss curves are overlapping so the batches are deterministic.
I also repeated experiments multiple times with different model sizes and data, all have similar phenomenon. The gap is just too big to be the result of randomness.

New updates added to the original post.

Did you synchronize the code properly before starting and stopping the timer?
If not, this would yield invalid profiling results, since CUDA operations are asynchronously executed and your Python script might “run ahead”.
Also note that you shouldn’t expect speedups on the P40, as no TensorCores are available on this device.

Yes, I understand. The throughput calculation is over many updates, so the slow down is real.
And for the comparisons, I keep the hardware all same. Even though P40 doesn’t have TensorCores, I did see speedups with APEX O2 in two of the three configurations above. And the one in which I installed Pytorch with pip install torch==1.6.0+cu101, the speed up is gone.

1 Like

Multiple iterations are a proper way to stabilize the results, however synchronizations are still needed.