OoM error as soon as 1st batch is ran thru network

I am trying to train a model on the Cityscapes dataset, for segmentation. I use torchvision deeplabv3_resnet50 model and it’s Cityscapes dataset class and transforms. In case it matters, I am running the code in Jupyter notebook.

The datasets are working, as are the dataloaders. When I attempt to train, I always get this error, at the point when the first batch is trying to be put thru the network (y_ = net(xb) in one_epoch function).

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 6.00 GiB total capacity; 4.20 GiB already allocated; 6.87 MiB free; 4.20 GiB reserved in total by PyTorch)

What is strange, is that no matter what the batch size (bs) is, the the amount of memory free according to the error is a value a little less than the amount of memory that is trying to be allocated, e.g. for bs=16 I get:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 6.00 GiB total capacity; 2.90 GiB already allocated; 1.70 GiB free; 2.92 GiB reserved in total by PyTorch)

I have a much more complicated model running, that will work with bs=16. This model builds everything from scratch. But I really want to be able to use the simplicity that torchvision seems to have with it’s model zoo and datasets.

My code is below, not much more than the bare essentials, enough to show if it is running ok on the GPU. Anyone able to help please?

def one_epoch(net, loss, dl, opt=None, metric=None):

if opt:
    net.train()  # only affects some layers
else:
    net.eval()
    rq_stored = []
    for p in net.parameters():
        rq_stored.append(p.requires_grad)
        p.requires_grad = False

L, M = [], []
dl_it = iter(dl)
for xb, yb in tqdm(dl_it, leave=False):
    xb, yb = xb.cuda(), yb.cuda()
    y_ = net(xb)
    l = loss(y_, yb)
    if opt:
        opt.zero_grad()
        l.backward()
        opt.step()
    L.append(l.detach().cpu().numpy())
    if metric: M.append(metric(y_, yb).cpu().numpy())

if not opt:
    for p,rq in zip(net.parameters(), rq_stored): p.requires_grad = rq

return L, M

accuracy = lambda y_,yb: (y_.max(dim=1)[1] == yb).float().mean()

def fit(net, tr_dl, val_dl, loss=nn.CrossEntropyLoss(), epochs=3, lr=3e-3, wd=1e-3):   

opt = optim.Adam(net.parameters(), lr=lr, weight_decay=wd)

Ltr_hist, Lval_hist = [], []
for epoch in trange(epochs):
    Ltr,  _    = one_epoch(net, loss, tr_dl,  opt)
    Lval, Aval = one_epoch(net, loss, val_dl, None, accuracy)
    Ltr_hist.append(np.mean(Ltr))
    Lval_hist.append(np.mean(Lval))
    print(f'epoch: {epoch+1}\ttraining loss: {np.mean(Ltr):0.4f}\tvalidation loss: {np.mean(Lval):0.4f}\tvalidation accuracy: {np.mean(Aval):0.2f}')

return Ltr_hist, Lval_hist

class To3ch(object):
def __call__(self, pic):
    if pic.shape[0]==1: pic = pic.repeat(3,1,1)
    return pic

bs = 1
imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

transf = transforms.Compose([
    transforms.ToTensor(),
    To3ch(),
    transforms.Normalize(*imagenet_stats)
])

train_ds = datasets.Cityscapes('C:/cityscapes_ds', split='train', target_type='semantic', transform=transf, target_transform=transf)
val_ds = datasets.Cityscapes('C:/cityscapes_ds', split='val', target_type='semantic', transform=transf, target_transform=transf)

train_dl  = DataLoader(train_ds,  batch_size=bs,   shuffle=True,  num_workers=0)
val_dl = DataLoader(val_ds, batch_size=2*bs, shuffle=False, num_workers=0)

net = models.segmentation.deeplabv3_resnet50(num_classes=20)
fit(net.cuda(), train_dl, val_dl, loss=nn.CrossEntropyLoss(), epochs=1, lr=1e-4, wd=1e-4, plot=True)

I was forgetting to crop the massive Cityscapes images, thus the OoM.