FCN training on full-size VOC data CUDA OOM

Hi!
I am trying to train an FCN8s net on full-sized VOC2012 data with batch size 1. I use Pytorch 1.0.0a0 and 32Gb GPU. Notice, that the original model was trained on a 12Gb card. Here is the optimisation loop:

    for epoch in range(1, NUM_EPOCHS + 1):
        for img, lbl in tqdm.tqdm(train_loader):
            print(ssms.util.memuse())
            i += 1
            img = img.cuda()
            lbl = lbl.cuda().long().squeeze_()
            if lbl.dim() < 3: # for batch size 1
                lbl.unsqueeze_(0)

            optimizer.zero_grad()
            out = model(img)
            loss = loss_fn(input=out, target=lbl)
            loss.backward()
            optimizer.step()

I print memory profile and after the first batch the max allocated memory is 18Gb (this is already more than 12Gb!) and on the second iteration the script crashes with OOM. The same model successfully learns 256px-downsampled data with batch 20. Please, help me to understand what is going on?

Adding something like del loss, out, img, lbl; gc.collect() after optimizer.step() didn’t help…