I’m trying to fine-tune Resnet18, but replaced BatchNorm with GroupNorm. It works on CPU, it works when I don’t unfreeze the GroupNorm, it works with BatchNorm… but with GroupNorm unfrozen, it always fails with RuntimeError: CUDA error: an illegal memory access was encountered
.
Here is a minimal code that reproduces the issue:
## Repro CUDA issue
import torch
from torchvision.models import resnet18
m = resnet18(pretrained=True)
def replace_bn(m):
for name, child in m.named_children():
if type(child) == torch.nn.BatchNorm2d:
setattr(m, name, torch.nn.GroupNorm(num_groups=1, num_channels=child.num_features))
replace_bn(child)
replace_bn(m)
# Freeze the model, except GroupNorm
for p in m.parameters():
p.requires_grad = False
def unfreeze_norm_layers(m):
if type(m) == torch.nn.modules.batchnorm.BatchNorm2d or type(m) == torch.nn.GroupNorm:
for p in m.parameters():
p.requires_grad = True
m.apply(unfreeze_norm_layers);
device = torch.device('cuda')
m.to(device)
inp = torch.randn(64, 3, 218, 178).to(device)
labels = torch.randint(0, 1000, (64,)).to(device)
opt = torch.optim.SGD(m.parameters(), lr=0.01, momentum=0.9)
opt.zero_grad()
l = torch.nn.CrossEntropyLoss()(m(inp), labels)
l.backward()
l.item()
# -> RuntimeError: CUDA error: an illegal memory access was encountered
# It works if you don't replace the BatchNorm (you can unfreeze BatchNorms)
# It works if you don't unfreeze the GroupNorm
# It works on CPU
# torch version: 1.6.0+cu92
Any ideas what am I doing wrong?