I have moved my model from V100 to A100 and instead of seeing an increase in speed these has been a significant slowdown from 14.2 it/sec to 10.06 it/sec.
cuda version 11.3
Pytorch version 1.9.0+cu111
I have been specifically using the code from GitHub repository (NLSPN)
There is a apex dependency in the repository which I thought to be the issue, but removing it and using a single GPU training also suffers from this issue.
The repository depends on Deformable-Convolution-V2-PyTorch, which seems to have been written ~3 years ago. Are you also seeing a slowdown without these custom layers or did you profile the model to see which operations are the bottleneck?
dist.init_process_group(backend='nccl', init_method='env://',
world_size=args.num_gpus, rank=gpu)
torch.cuda.set_device(gpu)
# Prepare dataset
data = get_data(args)
data_train = data(args, 'train')
data_val = data(args, 'val')
sampler_train = DistributedSampler(
data_train, num_replicas=args.num_gpus, rank=gpu)
sampler_val = DistributedSampler(
data_val, num_replicas=args.num_gpus, rank=gpu)
batch_size = args.batch_size // args.num_gpus
loader_train = DataLoader(
dataset=data_train, batch_size=batch_size, shuffle=False,
num_workers=args.num_threads, pin_memory=True, sampler=sampler_train,
drop_last=True)
loader_val = DataLoader(
dataset=data_val, batch_size=1, shuffle=False,
num_workers=args.num_threads, pin_memory=True, sampler=sampler_val,
drop_last=False)
# Network
model = get_model(args)
net = model(args)
net.cuda(gpu)
if gpu == 0:
if args.pretrain is not None:
assert os.path.exists(args.pretrain), \
"file not found: {}".format(args.pretrain)
checkpoint = torch.load(args.pretrain)
net.load_state_dict(checkpoint['net'])
print('Load network parameters from : {}'.format(args.pretrain))
# Loss
loss = get_loss(args)
loss = loss(args)
loss.cuda(gpu)
# Optimizer
optimizer, scheduler = utility.make_optimizer_scheduler(args, net)
net = apex.parallel.convert_syncbn_model(net)
net, optimizer = amp.initialize(net, optimizer, , opt_level=args.opt_level,
verbosity=0)
net = DDP(net)
for epoch in range(1, args.epochs+1):
for batch, sample in enumerate(loader_train):
sample = {key: val.cuda(gpu) for key, val in sample.items()
if val is not None}
if epoch == 1 and args.warm_up:
warm_up_cnt += 1
for param_group in optimizer.param_groups:
lr_warm_up = param_group['initial_lr'] \
* warm_up_cnt / warm_up_max_cnt
param_group['lr'] = lr_warm_up
optimizer.zero_grad()
output = net(sample)
The code is taken from src/main.py from the NLSPN repository. There are two inputs Rgb torch.Size([24, 3, 224, 304]) and lidar torch.Size([24, 1, 224, 304]). I have even tried to remove the apex dependency. That does not cause any issues with respect to the slowdown time.