pytorch 1.7.0 is much slower than pytorch 1.3.1

suuankotanki · November 17, 2020, 10:13am

Hello, I changed my device from 2080Ti to 3090 recently, and I created a new conda enviroment to install pytorch 1.7.0. But I found that pytorch1.7.0 took more time training especially when I enabled nvidia-apex or torch.cuda.amp, it was even up to 6x time slower!
(2080Ti pytorch1.3.1 with nvidia-apex: 1.6it/s
3090 pytorch1.7.0 with torch.cuda.amp: 5.35s/it)
It was so strange that I had to try a number of experiments to confirm where was the key point.

(I used my old 2080Ti by the way because 3090 can’t run pytorch1.3.1)

first I created two new conda enviroments pyt1 and pyt2.
“conda create -n pyt1 python=3.6.9”
“conda create -n pyt2 python=3.6.9”
then in pyt1 I installed pytorch1.7.0 by “conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch” and “conda install tqdm”
it installed cudatoolkit 11.0.221 and pytorch1.7.0
after that I chose the official pytorch example mnist code in githubexamples
the only code I added is tqdm to show the progress bar.
and ran it:“(pyt1)XXXXX:~$python main.py”
the result showed that:
2020-11-17 16-49-23屏幕截图

then in pyt2 I installed pytorch1.3.1 by “conda install pytorch torchvision” and “conda install tqdm”.
It installed default pytorch1.3.1 and cudatoolkit10.0.130
this time the result showed that:
2020-11-17 16-58-11屏幕截图
It was clear that pytorch1.7.0 was about 1/3 slower than pytorch1.3.1
my PC enviroment:
nvidia driver 450. 66
CUDA 11.0
but conda enviroment will use its own virtual cudatoolkit right? I don’t know if the external enviroment will affect the conda internal enviroment.

ptrblck · November 17, 2020, 10:14am

The PyTorch 1.7.0 binaries with CUDA11.0 use cudnn8.0.3, which doesn’t ship with trained heuristics for the RTX3090. We are targeting cudnn8.0.5 for 1.7.1, which should improve this situation.

suuankotanki · November 17, 2020, 11:34am

thanks for your reply but the results of the above tests are all based on 2080Ti, not 3090,which is what I find most strange.

ptrblck · November 17, 2020, 11:45am

This might be an unrelated regression on the 2080Ti.
Could you post the model and shapes you are using?
Also, did you use torch.backends.cudnn.benchmark = True, which would use the heuristics to select the fastest kernel for your workload?

suuankotanki · November 17, 2020, 12:26pm

I just used the official pytorch example “Basic MNIST Example” which could be found in pytorch’s github to finish this test https://github.com/pytorch/examples/tree/master/mnist
I tried using torch.backends.cudnn.benchmark = True but seems no change.

I simply downloaded this code, added a line to show the progress bar and ran it under pytorch1.7.0 and pytorch1.3.1, but got totally different results

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
from tqdm import tqdm
from apex import amp
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in tqdm(enumerate(train_loader)):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        if(args.enable_apex):
            with amp.scale_loss(loss, optimizer) as scaled_loss:
                scaled_loss.backward()
        else:
            loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if args.dry_run:
                break


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in tqdm(test_loader):
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    parser.add_argument('--enable_apex', action='store_true', default=False,
                        help='Enable Nvidia-Apex')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    torch.manual_seed(args.seed)

    device = torch.device("cuda" if use_cuda else "cpu")

    train_kwargs = {'batch_size': args.batch_size}
    test_kwargs = {'batch_size': args.test_batch_size}
    if use_cuda:
        cuda_kwargs = {'num_workers': 1,
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)
        test_kwargs.update(cuda_kwargs)

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    dataset1 = datasets.MNIST('data', train=True, download=True,
                       transform=transform)
    dataset2 = datasets.MNIST('data', train=False,
                       transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset1,**train_kwargs)
    test_loader = torch.utils.data.DataLoader(dataset2, **test_kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)

    if(args.enable_apex):
        model,optimizer = amp.initialize(model,optimizer,opt_level='O1')

    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)
        scheduler.step()

    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == '__main__':
    main()

ptrblck · November 17, 2020, 10:59pm

Thanks for this information. Could you compare the script with PyTorch 1.7.0 and CUDA10.2, which uses cudnn7.6.5? I still think your 2080Ti is seeing a regression on cudnn8.0.3.
I can run the workload on my system later with cudnn8.0.5 and the latest internal version.

suuankotanki · November 18, 2020, 3:25pm

sure,I compared different versions of pytorch ,cuda and cudnn,the results are:

pytorch1.7.0, cuda11.0, cudnn 8.0.3
without apex: [112it/s]
with apex :[82it/s]

pytorch1.7.0, cuda10.2, cudnn 7.6.5
without apex: [110it/s]
with apex :[82it/s]

pytorch1.3.1, cuda10.0.1, cudnn 7.6.5
without apex: [147it/s]
with apex:[116it/s]

pytorch1.4.0, cuda10.0.1, cudnn 7.6.3
without apex: [141it/s]
with apex: [113it/s]

I’m not a English speaker so I don’t quite get what does “regression” mean, But in my point of view, the version of pytorch makes a big difference in performance, I hope you can help me to figure it out

hinken · November 18, 2020, 4:29pm

Regression means an unwanted drop in performance or new bugs discovered in old, well tested and previously bug-free code after updating something

So ptrblck was trying to say that a later pytorch update could accidentally have made something worse for 2080Ti which would otherwise be considered an old and well-tested platform with pyTorch.

suuankotanki · November 19, 2020, 1:46am

Oh, I finally get it. Thank you very much!

ptrblck · November 19, 2020, 7:56am

Thanks for the update and @hinken is right in what I meant.

I’ll try to reproduce the slowdown, as it doesn’t seem to come from the CUDA/cudnn update.

huiwong · March 6, 2021, 3:02am

Hello，I had the same problem, did you figure out why now?
I use torch 1.6 before, after updating to 1.7， the run time become very slow, (0.04s vs 0.14s in single net)

ptrblck · March 6, 2021, 6:17am

For the previous slowdown using the same 3rd party libs, the framework overhead might play a role in it.

For your issue between 1.6 vs. 1.7 we would need more information. I.e. which CUDA, cudnn versions are you using? Were any of these changed? What kind of device and model etc.?

huiwong · March 7, 2021, 6:20am

Thanks for your reply. I think it may be that my computer’s performance is not stable. It becomes normal now.