Writing GPU-memory efficient code

I trained an autoencoder and now want to embed my images as an average of 100 subsampled embeddings (randomly cropped from much larger images). But when I try to embed these images, I run out of GPU memory…

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.cu:58

…but only after hitting ~80 samples.

Given that (1) I was able to train my model on this data, and (2) I am able to load the data and perform several dozen forward passes, I suspect the issue is that my program is holding on to memory rather than letting it go.

I used Soumith’s recommendations to wrap code into functions so variables can be garbage collected and to also explicitly call gc.collect(). Also, my batch size is only 1. Is there anything else I can change to make this program use less memory?

Here are the two main functions:

def preembed_images(subdir, model):
    model.eval()
    model = cuda.ize(model)

    dataset = GTExImages()
    indices = list(range(len(dataset)))
    data_loader = DataLoader(dataset=dataset,
                             batch_size=1,
                             sampler=SequentialSampler(indices),
                             num_workers=4,
                             pin_memory=use_cuda)

    Z = torch.Tensor(N_SAMPLES, D_EMBEDDINGS)

    for i, x in enumerate(data_loader):
        print('Embedded %s-th image.' % i)
        Z[i] = embed_one_image(x, model, dataset.subsample, cfg)
        gc.collect()

    torch.save(Z, '%s/embedded_images.pt' % subdir)

# ------------------------------------------------------------------------------

def embed_one_image(x, model, subsample, cfg):
    Z = torch.Tensor(N_Z_PER_SAMPLE, D_EMBEDDINGS)
    for i in range(N_Z_PER_SAMPLE):
        xi = subsample(x.squeeze(0)).unsqueeze(0)
        zi = model(cuda.ize(xi))
        Z[i]  = zi
    return Z.mean(dim=0)

Maybe that’s not a probelm caused by “out of memory”, just reinstall your pytorch, have a try!