PyTorch vs Theano GPU RAM usage issue


I’ve noticed that PyTorch (torch (0.1.10.post2), torchvision (0.1.7)) uses significantly more GPU RAM than e.g. Theano running a similar code. Unfortunately I cannot disclose the actual code but I think it should be possible to reproduce this behavior with the following simple sequential architecture (all dimensions are in form MINIBATCH_LENGTH x NUM_CHANNELS x H x W):

N - minibatch size, e.g. 20, padding mode == “same” everywhere

Input: Nx10x1024x1024
Layer0: Conv2D, stride=2, filters=32, output: Nx32x512x512
Layer1: Conv2D, stride=2, filters=64, output: Nx64x256x256
Layer2: Conv2D, stride=2, filters=128, output: Nx128x128x128
… all the way down to 16x16 feature maps
LayerX: Conv2D, stride=2, filters=…, output: Nx…x16x16
… now, go opposite way - transposed convolution
LayerX+1: ConvTranspose2D, stride=2, filters=…, output: Nx…x32x32
… all the way up …
Layer(Last): ConvTranspose2D, stride=2, filters=…, output: Nx10x1024x1024

Main loop:

net = Net()

optimizer = optim.SGD(net.parameters(), lr = ...)
criterion = nn.MSELoss()

input = Variable(torch.from_numpy(<your-Nx10x1024x1024-tensor>).cuda())
target = Variable(torch.from_numpy(<your-Nx10x1024x1024-tensor>).cuda())

for epoch in xrange(...):

    output = net(input)
    loss = criterion(output, target)

On a GTX-1060 corresponding code takes around 30% more GPU RAM than its Theano counterpart (exection times are about the same for Theano and PyTorch versions).

Is this something that can be fixed?


PyTorch uses a caching memory allocator, which means that it might oversubscribe memory even if it is not using it and cache blocks. So in that case, nvidia-smi wont give you the exact memory used by pytorch. In out-of-memory situations, pytorch automatically frees blocks that are cached.

This might be one reason to explain the memory difference.

thanks for your reply. This seems to be the case indeed (the maximum mini-batch size seems to be quite similar for both Theano and PyTorch).

However, is there a way to constrain memory allocation? In Theano there’s lib.cnmem which can be assigned % of memory to pre-allocate. It’s a soft limit but still helpful when sharing GPU between multiple jobs.

there is no way to constrain the memory to an upper bound right now.