Hi,
I’m experiencing strange slow training speed with PyTorch 1.6.0+AMP.
I built 2 docker images, and the only difference between them is one have torch 1.5.0+cu101 and the other have torch 1.6.0+cu101. On these two docker images, I ran same code (Huggingface xlmr-base model for token classification) on same hardware (P40 GPU), with no distributed data parallel or gradient accumulation. The table below summarizes the training speed I got:
samples/s | PyTorch 1.5.0 | PyTorch 1.6.0 | diff |
---|---|---|---|
FP32 | 51.97 | 51.57 | -0.4 |
FP16 with apex.amp O1 | 51.25 | 47.43 | -3.82 |
FP16 with apex.amp O2 | 56.68 | 49.09 | -7.59 |
FP16 with torch.cuda.amp | N/A | 47.17 | N/A |
Comparing between 1.5.0 and 1.6.0, the FP32 speeds are close, but the speeds with APEX AMP O1 and O2 are both significantly slower with 1.6.0. PyTorch 1.6.0 native AMP is also much slower compared to 1.5.0+apex.amp. All 3 FP16 AMP configurations with 1.6.0 are slower than FP32.
Again, only difference is the PyTorch version in the docker images. Other things common in the images are: cuda 10.1, cudnn 7.6.5.32-1+cuda10.1, python 3.6.8,
Do you have any suggestions on this problem?
Edit:
Upon further investigation, I tried to build a docker imaged by extending pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
, and do same experiments on it. Now I get comparable results on 1.6.0 in it:
samples/s | docker image A | docker image B | |
---|---|---|---|
PyTorch 1.5.0 | PyTorch 1.6.0 | PyTorch 1.6.0 | |
FP32 | 51.97 | 51.57 | 51.44 |
FP16 with apex.amp O1 | 51.25 | 47.43 | 50.57 |
FP16 with apex.amp O2 | 56.68 | 49.09 | 55.98 |
FP16 with torch.cuda.amp | N/A | 47.17 | 48.26 |
In docker image A, I have:
FROM nvidia/cuda:10.1-devel-ubuntu18.04
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV PYTORCH_VERSION=1.6.0+cu101 # or 1.5.0+cu101
ENV TORCHVISION_VERSION=0.7.0+cu101 # or 0.6.0+cu101
...
...
RUN apt-get update && \
apt-get install -y --allow-change-held-packages --allow-downgrades --no-install-recommends \
libcudnn7=${CUDNN_VERSION} \
libcudnn7-dev=${CUDNN_VERSION} \
RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION} -f https://download.pytorch.org/whl/torch_stable.html
...
...
In docker image B, I have:
FROM pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
...
...
So my new question is what could be the issue with docker image A that causes FP16 speed regression from 1.5.0 to 1.6.0? Am I installing pytorch correctly in it?
Thank you