Question: How to reduce latency(computational cost) in ConvTranspose2d
?
I compared latency between nn.Conv2d
, Upsample(mode=nearest)
, ConvTranspose2d
, Upsample(mode=bilinear)
in different batch size. It looks like deconvolution operation takes a lot of time in large batch (=64). Is there a way to reduce computation time for deconvolution?
Thank you
Comparison Table
[------------------------------------- upsample module comparison ------------------------------------]
| nearest(scale=2) | conv(kernel=3) | deconv(scale=2) | bilinear(scale=2)
4 threads: --------------------------------------------------------------------------------------------
[8, 2048, 8, 8] | 30.4 | 94.6 | 134.6 | 4838.2
[16, 2048, 8, 8] | 58.4 | 94.3 | 249.4 | 9682.3
[32, 2048, 8, 8] | 111.6 | 119.1 | 480.1 | 19434.2
[64, 2048, 8, 8] | 219.9 | 206.1 | 12399.6 | 38844.8
Times are in microseconds (us).
Comparison Code
import os
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '4'
import torch
import torch.nn as nn
import torch.utils.benchmark as benchmark
import torchvision.models as models
from itertools import product
def get_upsample_module(mode, upsample, ch):
if mode == 'deconv':
return nn.Sequential(
torch.nn.ConvTranspose2d(ch, ch, upsample, stride=upsample, dilation=1, groups=ch, bias=False),
torch.nn.BatchNorm2d(ch)
).cuda()
else:
return nn.Upsample(scale_factor=upsample, mode=mode).cuda()
def get_downsample_module(kernel, ch):
return nn.Sequential(
torch.nn.Conv2d(ch, ch, kernel, stride=1, padding=1, dilation=1, groups=ch, bias=False),
torch.nn.BatchNorm2d(ch)
).cuda()
results = []
batch_size = [8, 16, 32, 64]
channel_size = [2048]
image_size = [8]
mode = ['nearest', 'conv', 'deconv', 'bilinear']
scale_factors = [2]
for b, c, n in product(batch_size, channel_size, image_size):
label = 'upsample module comparison'
sub_label = f'[{b}, {c}, {n}, {n}]'
x = torch.rand((b, c, n, n)).cuda()
for method, upsample in product(mode, scale_factors):
if method == 'conv':
upsample += 1
model = get_downsample_module(upsample, c)
else:
model = get_upsample_module(method, upsample, c)
with torch.cuda.amp.autocast():
results.append(benchmark.Timer(
stmt='model(x)',
setup='from __main__ import model',
globals={'x': x},
label=label,
sub_label=sub_label,
description=f"{method}(scale={upsample})",
num_threads=4
).blocked_autorange(min_run_time=1))
compare = benchmark.Compare(results)
compare.colorize()
compare.print()
My Environment:
- OS: Linux 18.04
- Python: 3.7.11
- Pytorch: 1.10.0
- CUDA: 11.3