While training 3D CNN with Apex amp, I found very weird result (accuracy drops around 6%) specifically on pytorch version 1.4.0 , V100 gpu, and amp opt-level O1 and O2.
On pytorch version 1.1.0, it worked correctly both on P40, V100 GPU with opt-level O2.
On pytorch version 1.4.0, it worked correctly on P40 GPU with opt-level O2 (maybe also OK with O0).
On pytorch version 1.4.0, it worked correctly on V100 GPU with opt-level O0.
So I compare every single module’s output in my model with opt-level O0 and O2 on cuda9.2, pytorch version 1.4.0, V100 GPU.
And I found that Conv3D with kernel_size (3,1,1) outputs show large difference according to opt-level.
So I wrote a code snippet that can observe the behavior as below.
It seems that it happens when channel size is large.
import torch
import torch.nn as nn
import apex
from apex import amp
def init_model(in_channel=8, out_channel=4, kernel_size=(3, 1, 1), padding=(1, 0, 0), opt_level='O0'):
model = nn.Conv3d(in_channel, out_channel, kernel_size=kernel_size, padding=padding, bias=False)
model = model.cuda()
parameters = model.parameters()
optimizer = torch.optim.SGD(parameters,
lr=0.1,
)
model, optimizer = amp.initialize(model, optimizer,
opt_level=opt_level,
keep_batchnorm_fp32=None if opt_level=='O1' else True,
loss_scale=None
)
return model, optimizer
def compare_opt_level(in_channel, out_channel, kernel_size, padding, opt1='O0', opt2='O2'):
input = torch.randn(size=(4, in_channel, 8, 16, 16)).cuda()
n0, _ = init_model(in_channel, out_channel, kernel_size=kernel_size, padding=padding, opt_level=opt1)
n2, _ = init_model(in_channel, out_channel, kernel_size=kernel_size, padding=padding, opt_level=opt2)
init_param = torch.randn_like(n0.weight)
#init_param = nn.init.kaiming_normal(n0.weight, mode='fan_out')
n0.weight = nn.Parameter(init_param)
n2.weight = nn.Parameter(init_param.half())
v0 = n0(input)
v2 = n2(input)
print('/////////////////////////////')
print('Compare {}/{}, in/out channel {}/{}, kernel size {}'.format(
opt1, opt2, in_channel, out_channel, kernel_size))
print(torch.sqrt((v0 - v2) ** 2).mean(dim=(1, 2, 3, 4)))
if __name__=='__main__':
torch.random.manual_seed(1)
ic, oc = 8, 4
compare_opt_level(ic, oc, kernel_size=(3, 1, 1), padding=(1, 0, 0))
compare_opt_level(ic, oc, kernel_size=(3, 3, 3), padding=(1, 1, 1))
compare_opt_level(ic, oc, kernel_size=(1, 1, 1), padding=(0, 0, 0))
compare_opt_level(ic, oc, kernel_size=(1, 3, 3), padding=(0, 1, 1))
ic, oc = 512, 128
compare_opt_level(ic, oc, kernel_size=(3, 1, 1), padding=(1, 0, 0), opt1='O0', opt2='O2')
compare_opt_level(ic, oc, kernel_size=(3, 3, 3), padding=(1, 1, 1), opt1='O0', opt2='O2')
compare_opt_level(ic, oc, kernel_size=(1, 1, 1), padding=(0, 0, 0), opt1='O0', opt2='O2')
compare_opt_level(ic, oc, kernel_size=(1, 3, 3), padding=(0, 1, 1), opt1='O0', opt2='O2')
ic, oc = 512, 128
compare_opt_level(ic, oc, kernel_size=(3, 1, 1), padding=(1, 0, 0), opt1='O0', opt2='O1')
compare_opt_level(ic, oc, kernel_size=(3, 3, 3), padding=(1, 1, 1), opt1='O0', opt2='O1')
compare_opt_level(ic, oc, kernel_size=(1, 1, 1), padding=(0, 0, 0), opt1='O0', opt2='O1')
compare_opt_level(ic, oc, kernel_size=(1, 3, 3), padding=(0, 1, 1), opt1='O0', opt2='O1')
The printed MSE between O0 and O2 on V100 are,
Compare O0/O2, in/out channel 8/4, kernel size (3, 1, 1)
tensor([0.0016, 0.0017, 0.0016, 0.0016], device='cuda:0',
grad_fn=<MeanBackward1>)
Compare O0/O2, in/out channel 8/4, kernel size (3, 3, 3)
tensor([0.0042, 0.0043, 0.0042, 0.0042], device='cuda:0',
grad_fn=<MeanBackward1>)
Compare O0/O2, in/out channel 8/4, kernel size (1, 1, 1)
tensor([0.0007, 0.0007, 0.0007, 0.0007], device='cuda:0',
grad_fn=<MeanBackward1>)
Compare O0/O2, in/out channel 8/4, kernel size (1, 3, 3)
tensor([0.0021, 0.0021, 0.0022, 0.0021], device='cuda:0',
grad_fn=<MeanBackward1>)
Compare O0/O2, in/out channel 512/128, kernel size (3, 1, 1)
tensor([34.6549, 34.6763, 34.7861, 34.7793], device='cuda:0',
grad_fn=<MeanBackward1>)
Compare O0/O2, in/out channel 512/128, kernel size (3, 3, 3)
tensor([0.0303, 0.0303, 0.0303, 0.0303], device='cuda:0',
grad_fn=<MeanBackward1>)
Compare O0/O2, in/out channel 512/128, kernel size (1, 1, 1)
tensor([0.0064, 0.0064, 0.0064, 0.0064], device='cuda:0',
grad_fn=<MeanBackward1>)
Compare O0/O2, in/out channel 512/128, kernel size (1, 3, 3)
tensor([0.0183, 0.0183, 0.0183, 0.0183], device='cuda:0',
grad_fn=<MeanBackward1>)
You can see that the MSE is exceptionally large when kernel size is (3, 1, 1) with large channels
Unlike V100, the MSE is not so big on P40,
Compare O0/O2, in/out channel 8/4, kernel size (3, 1, 1)
tensor([0.0014, 0.0014, 0.0014, 0.0014], device='cuda:0',
grad_fn=<MeanBackward2>)
Compare O0/O2, in/out channel 8/4, kernel size (3, 3, 3)
tensor([0.0038, 0.0037, 0.0037, 0.0037], device='cuda:0',
grad_fn=<MeanBackward2>)
Compare O0/O2, in/out channel 8/4, kernel size (1, 1, 1)
tensor([0.0009, 0.0009, 0.0009, 0.0009], device='cuda:0',
grad_fn=<MeanBackward2>)
Compare O0/O2, in/out channel 8/4, kernel size (1, 3, 3)
tensor([0.0023, 0.0023, 0.0023, 0.0023], device='cuda:0',
grad_fn=<MeanBackward2>)
Compare O0/O2, in/out channel 512/128, kernel size (3, 1, 1)
tensor([0.0106, 0.0106, 0.0106, 0.0106], device='cuda:0',
grad_fn=<MeanBackward2>)
Compare O0/O2, in/out channel 512/128, kernel size (3, 3, 3)
tensor([0.0305, 0.0303, 0.0303, 0.0304], device='cuda:0',
grad_fn=<MeanBackward2>)
Compare O0/O2, in/out channel 512/128, kernel size (1, 1, 1)
tensor([0.0064, 0.0064, 0.0064, 0.0064], device='cuda:0',
grad_fn=<MeanBackward2>)
Compare O0/O2, in/out channel 512/128, kernel size (1, 3, 3)
tensor([0.0184, 0.0184, 0.0183, 0.0183], device='cuda:0',
grad_fn=<MeanBackward2>)
When I compare O0 and O1, MSEs are all zero.
However the training was also not working properly with O1 on V100, so there might be some other issues.
I recently found that there is torch.cuda.amp
in nightly version, but haven’t tested on it.
I cannot move to torch.cuda.amp
because it seems that O2 level is not provided yet.