PyTorch vs Theano GPU RAM usage issue

vladimir · March 24, 2017, 3:10am

Hello,

I’ve noticed that PyTorch (torch (0.1.10.post2), torchvision (0.1.7)) uses significantly more GPU RAM than e.g. Theano running a similar code. Unfortunately I cannot disclose the actual code but I think it should be possible to reproduce this behavior with the following simple sequential architecture (all dimensions are in form MINIBATCH_LENGTH x NUM_CHANNELS x H x W):

N - minibatch size, e.g. 20, padding mode == “same” everywhere

Input: Nx10x1024x1024
Layer0: Conv2D, stride=2, filters=32, output: Nx32x512x512
Layer1: Conv2D, stride=2, filters=64, output: Nx64x256x256
Layer2: Conv2D, stride=2, filters=128, output: Nx128x128x128
… all the way down to 16x16 feature maps
LayerX: Conv2D, stride=2, filters=…, output: Nx…x16x16
… now, go opposite way - transposed convolution
LayerX+1: ConvTranspose2D, stride=2, filters=…, output: Nx…x32x32
… all the way up …
Layer(Last): ConvTranspose2D, stride=2, filters=…, output: Nx10x1024x1024

Main loop:

net = Net()
net.cuda()
net.train()

optimizer = optim.SGD(net.parameters(), lr = ...)
criterion = nn.MSELoss()

input = Variable(torch.from_numpy(<your-Nx10x1024x1024-tensor>).cuda())
target = Variable(torch.from_numpy(<your-Nx10x1024x1024-tensor>).cuda())

for epoch in xrange(...):

    output = net(input)
    loss = criterion(output, target)
    
    net.zero_grad()
    loss.backward()
    optimizer.step()

On a GTX-1060 corresponding code takes around 30% more GPU RAM than its Theano counterpart (exection times are about the same for Theano and PyTorch versions).

Is this something that can be fixed?

Thanks,

smth · March 26, 2017, 6:12pm

PyTorch uses a caching memory allocator, which means that it might oversubscribe memory even if it is not using it and cache blocks. So in that case, nvidia-smi wont give you the exact memory used by pytorch. In out-of-memory situations, pytorch automatically frees blocks that are cached.

This might be one reason to explain the memory difference.

vladimir · March 27, 2017, 1:02am

thanks for your reply. This seems to be the case indeed (the maximum mini-batch size seems to be quite similar for both Theano and PyTorch).

However, is there a way to constrain memory allocation? In Theano there’s lib.cnmem which can be assigned % of memory to pre-allocate. It’s a soft limit but still helpful when sharing GPU between multiple jobs.

smth · March 27, 2017, 2:40am

there is no way to constrain the memory to an upper bound right now.