FP16 (AMP) training slow down with PyTorch 1.6.0

Liyang_Lu · September 17, 2020, 8:23pm

Hi,

I’m experiencing strange slow training speed with PyTorch 1.6.0+AMP.

I built 2 docker images, and the only difference between them is one have torch 1.5.0+cu101 and the other have torch 1.6.0+cu101. On these two docker images, I ran same code (Huggingface xlmr-base model for token classification) on same hardware (P40 GPU), with no distributed data parallel or gradient accumulation. The table below summarizes the training speed I got:

samples/s	PyTorch 1.5.0	PyTorch 1.6.0	diff
FP32	51.97	51.57	-0.4
FP16 with apex.amp O1	51.25	47.43	-3.82
FP16 with apex.amp O2	56.68	49.09	-7.59
FP16 with torch.cuda.amp	N/A	47.17	N/A

Comparing between 1.5.0 and 1.6.0, the FP32 speeds are close, but the speeds with APEX AMP O1 and O2 are both significantly slower with 1.6.0. PyTorch 1.6.0 native AMP is also much slower compared to 1.5.0+apex.amp. All 3 FP16 AMP configurations with 1.6.0 are slower than FP32.

Again, only difference is the PyTorch version in the docker images. Other things common in the images are: cuda 10.1, cudnn 7.6.5.32-1+cuda10.1, python 3.6.8,

Do you have any suggestions on this problem?

Edit:

Upon further investigation, I tried to build a docker imaged by extending pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel, and do same experiments on it. Now I get comparable results on 1.6.0 in it:

samples/s	docker image A		docker image B
	PyTorch 1.5.0	PyTorch 1.6.0	PyTorch 1.6.0
FP32	51.97	51.57	51.44
FP16 with apex.amp O1	51.25	47.43	50.57
FP16 with apex.amp O2	56.68	49.09	55.98
FP16 with torch.cuda.amp	N/A	47.17	48.26

In docker image A, I have:

FROM nvidia/cuda:10.1-devel-ubuntu18.04

ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV PYTORCH_VERSION=1.6.0+cu101 # or 1.5.0+cu101
ENV TORCHVISION_VERSION=0.7.0+cu101 # or 0.6.0+cu101
...
...
RUN apt-get update && \
    apt-get install -y --allow-change-held-packages --allow-downgrades --no-install-recommends \
    libcudnn7=${CUDNN_VERSION} \
    libcudnn7-dev=${CUDNN_VERSION} \

RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION} -f https://download.pytorch.org/whl/torch_stable.html
...
...

In docker image B, I have:

FROM pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
...
...

So my new question is what could be the issue with docker image A that causes FP16 speed regression from 1.5.0 to 1.6.0? Am I installing pytorch correctly in it?

Thank you

BramVanroy · September 17, 2020, 8:34pm

Just to be sure that you did a fair bemchmark: did you set all seeds, set cudnn deterministic mode and disabled its benchmark mode? Different seeds can mean differently shuffled batches, meaning which might be more efficient in some cases vs. others (depending on how the padding is done). Something like this:

def set_seed(seed: int):
    """Set all seeds to make results reproducible (deterministic mode).
       When seed is None, disables deterministic mode.
    :param seed: an integer to your choosing
    """
    if seed is not None:
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        np.random.seed(seed)
        random.seed(seed)
        os.environ['PYTHONHASHSEED'] = str(seed)

Liyang_Lu · September 17, 2020, 8:52pm

Thank you for the quick reply. Yes the seeds are all set. The loss curves are overlapping so the batches are deterministic.
I also repeated experiments multiple times with different model sizes and data, all have similar phenomenon. The gap is just too big to be the result of randomness.

Liyang_Lu · September 18, 2020, 6:53pm

New updates added to the original post.

ptrblck · September 21, 2020, 8:02am

Did you synchronize the code properly before starting and stopping the timer?
If not, this would yield invalid profiling results, since CUDA operations are asynchronously executed and your Python script might “run ahead”.
Also note that you shouldn’t expect speedups on the P40, as no TensorCores are available on this device.

Liyang_Lu · September 21, 2020, 9:06pm

Yes, I understand. The throughput calculation is over many updates, so the slow down is real.
And for the comparisons, I keep the hardware all same. Even though P40 doesn’t have TensorCores, I did see speedups with APEX O2 in two of the three configurations above. And the one in which I installed Pytorch with pip install torch==1.6.0+cu101, the speed up is gone.

ptrblck · September 22, 2020, 4:41am

Multiple iterations are a proper way to stabilize the results, however synchronizations are still needed.