Slower Mixed precision than fp32 on 2080 Ti RTX

Ushk · October 14, 2020, 4:47pm

I apologize if I’ve missed something obvious here - this question relates to issues I am having timing mixed precision vs float32 computation.

I have two servers - one with Pytorch 1.5 and Cuda 10.1, and the other with Pytorch 1.6 and Cuda 11.0. As far as I know there are no Pytorch CUDA 11.0 binaries, so that pytorch was compiled with 10.1. Both have 2080 Ti RTX GPU cards.

In both servers, I time pure fp32 computation as being significantly faster than mixed precision and I can’t work out why.

I am aware that for the Pytorch1.6 server the mismatch between CUDA versions is not ideal, but I’m still not sure why there should be an issue on the Pytorch 1.5 server.

My times are as follows:
Pytorch 1.5:

Pytorch 1.6

I’ve attached gists for the two scripts that I’m using to compute times below.

gist.github.com

https://gist.github.com/Ushk/96be97cd595ee63ae5cd6cebe415e642

gistfile1.txt

import random
import os
import numpy as np
import time
import torch
import torch.nn as nn
from apex import amp

def set_seed(seed: int):
    """Set all seeds to make results reproducible (deterministic mode).

This file has been truncated. show original

gist.github.com

https://gist.github.com/Ushk/8831348c071ddea6a90c7dddad124d14

gistfile1.txt

import os
import time
import torch
import torch.nn as nn
import numpy as np
import random
from apex import amp
from torch.cuda.amp import autocast

def set_seed(seed: int):

This file has been truncated. show original

Torch versions:

Output of nvcc --version

Nvidia-smi:

Based on this comment/thread:

github.com/facebookresearch/fairseq

Huge Training time for FP16 on Turing/2080 GPU

opened 02:16PM - 22 Jan 19 UTC

closed 06:29AM - 19 Feb 19 UTC

kalyangvs

I am training en-fr on GeForce RTX 2*2080 ti with nvidia driver version 410.79 (…although the latest is 410.93) with CUDA 10 and fp16 flag enabled With the following command python3.6 -m torch.distributed.launch --nproc_per_node 2 train.py data-bin --arch transformer --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0007 --min-lr 1e-09 --dropout 0.1 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 4096 --update-freq 8 --save-dir /experiments/transformer_base_fr_en --ddp-backend=no_c10d --fp16 But a part of my training log is | epoch 001: 0%| | 73/21742 [09:49<49:52:09, 8.29s/it, loss=14.730, nll_loss=14.568, ppl=24291.78, wps=4262, ups=0.1, wpb=60960, bsz=1607, num_updates=69, lr=1.21733e-05, gnorm=3.811, clip=0%, oom=0, loss_scale=8.000, wall=987, train_wall=586] which says 50 more hours are left, but when fp16 is disabled it takes around 10 hours per epoch on the whole 36 M sentences. Why is there such a huge drop ? Any flags to add ? Or might be a mismatch between driver version and graphic card ? Please help..

I would expect there to be a speedup on a 2080 Ti - is this correct?

seungjun · October 15, 2020, 2:35am

I experienced a similar phenomenon when running very small networks with amp.autocast. In that case, pure fp16 inference was the fastest followed by fp32 and amp.autocast was the slowest.
Autocasting requires type conversion of input and the corresponding layers before the main computation. Type casting could take more time than the saved time from main computation in lower precision. In larger networks, the gains in computation will be more than the loss from the type conversion.

ptrblck · October 15, 2020, 3:24am

In your scripts you are rightfully synchronizing before starting the timer, but no synchronization is used when you are stopping the timer.
Thus your timing might be wrong and in fact you might profile the PyTorch overhead and the kernel launch times.

Add syncs before starting and stopping the timer and execute the real workload (forward/backward) inside a loop to get more stable results.

Also, setting torch.backends.cudnn.benchmark = True and cudnn.deterministic = False might also help, as cudnn will profile different algos and will select the fastest one.
Note that this setup would increase the first iteration time (for forward and backward) significantly, as the profiling would be executed, so you should add warmup iterations.

I would also recommend to use the CUDA10.2 binaries, which ship with cudnn7.6.5.32 or the nightly CUDA11.0 binaries, which come now with cudnn8.0.4.30.

Ushk · October 15, 2020, 10:06pm

Thanks for the heads up @ptrblck!

Changed the two bools as suggested, added an extra synch call before stopping the timer, add an additional backward pass before timing for profiling for fp32 and a fwd/backward autocast pass. New timing with the additional synch call for the Pytorch 1.5 server:

So the timings went down - due to changing the bool values, as suggested - but still the wrong (intuitively) order.

@seungjun - thanks as well for the tip. I tried with Resnet18 and Resnet50 and the timings are even, which seems more reasonable, though is still a bit surprising.

The 1.6 server is currently occupied, will be able to update tomorrow. Unfortunately, I’m using a cloud provider that only has certain docker images - in this case:

pytorch:1.5.0-cuda10.1-cudnn7-devel
pytorch:1.6.0-cuda10.1-cudnn7-devel
and the two @ptrblck mentioned aren’t available :(.

ptrblck · October 15, 2020, 10:48pm

The CUDA10.2 binaries are available here and you would only have to select this particular version.
The CUDA11 nightly binaries can be installed by using cudatoolkit=11.0 in the conda install command.