A100 training slower than V100

I have moved my model from V100 to A100 and instead of seeing an increase in speed these has been a significant slowdown from 14.2 it/sec to 10.06 it/sec.
cuda version 11.3
Pytorch version 1.9.0+cu111
I have been specifically using the code from GitHub repository (NLSPN)

There is a apex dependency in the repository which I thought to be the issue, but removing it and using a single GPU training also suffers from this issue.

The repository depends on Deformable-Convolution-V2-PyTorch, which seems to have been written ~3 years ago. Are you also seeing a slowdown without these custom layers or did you profile the model to see which operations are the bottleneck?

Even after removing these deformable convolution layer there is a speed drop from 16.31it/s to 11.99 it/s.

Thanks for the update! Could you post the code to initialize the model as well as the input shapes you are using, please?

dist.init_process_group(backend='nccl', init_method='env://',
                            world_size=args.num_gpus, rank=gpu)
    torch.cuda.set_device(gpu)

    # Prepare dataset
    data = get_data(args)

    data_train = data(args, 'train')
    data_val = data(args, 'val')

    sampler_train = DistributedSampler(
        data_train, num_replicas=args.num_gpus, rank=gpu)
    sampler_val = DistributedSampler(
        data_val, num_replicas=args.num_gpus, rank=gpu)

    batch_size = args.batch_size // args.num_gpus

    loader_train = DataLoader(
        dataset=data_train, batch_size=batch_size, shuffle=False,
        num_workers=args.num_threads, pin_memory=True, sampler=sampler_train,
        drop_last=True)
    loader_val = DataLoader(
        dataset=data_val, batch_size=1, shuffle=False,
        num_workers=args.num_threads, pin_memory=True, sampler=sampler_val,
        drop_last=False)

    # Network
    model = get_model(args)
    net = model(args)
    net.cuda(gpu)

    if gpu == 0:
        if args.pretrain is not None:
            assert os.path.exists(args.pretrain), \
                "file not found: {}".format(args.pretrain)

            checkpoint = torch.load(args.pretrain)
            net.load_state_dict(checkpoint['net'])

            print('Load network parameters from : {}'.format(args.pretrain))

    # Loss
    loss = get_loss(args)
    loss = loss(args)
    loss.cuda(gpu)

    # Optimizer
    optimizer, scheduler = utility.make_optimizer_scheduler(args, net)

    net = apex.parallel.convert_syncbn_model(net)
    net, optimizer = amp.initialize(net, optimizer, , opt_level=args.opt_level,
                                    verbosity=0)
   net = DDP(net)
   for epoch in range(1, args.epochs+1):
       for batch, sample in enumerate(loader_train):
            sample = {key: val.cuda(gpu) for key, val in sample.items()
                      if val is not None}

            if epoch == 1 and args.warm_up:
                warm_up_cnt += 1

                for param_group in optimizer.param_groups:
                    lr_warm_up = param_group['initial_lr'] \
                                 * warm_up_cnt / warm_up_max_cnt
                    param_group['lr'] = lr_warm_up

            optimizer.zero_grad()

            output = net(sample)

The code is taken from src/main.py from the NLSPN repository. There are two inputs Rgb torch.Size([24, 3, 224, 304]) and lidar torch.Size([24, 1, 224, 304]). I have even tried to remove the apex dependency. That does not cause any issues with respect to the slowdown time.

Any idea what could be issue?

In the topic right below this one (Can not attain better performances after changing nvidia GPU - #8 by tjk) we found out that enabling cudnn benchmark can lead to a better perfomance on A5000.

Did you find out what the issue is? I also encountered this problem.

No. @my3bikaht , it did not work. @zimian_wei , did not find any solution.

I found that moving data to SSD helps.