I’ve noticed that PyTorch (torch (0.1.10.post2), torchvision (0.1.7)) uses significantly more GPU RAM than e.g. Theano running a similar code. Unfortunately I cannot disclose the actual code but I think it should be possible to reproduce this behavior with the following simple sequential architecture (all dimensions are in form MINIBATCH_LENGTH x NUM_CHANNELS x H x W):

N - minibatch size, e.g. 20, padding mode == “same” everywhere

Input: Nx10x1024x1024

Layer0: Conv2D, stride=2, filters=32, output: Nx32x512x512

Layer1: Conv2D, stride=2, filters=64, output: Nx64x256x256

Layer2: Conv2D, stride=2, filters=128, output: Nx128x128x128

… all the way down to 16x16 feature maps

LayerX: Conv2D, stride=2, filters=…, output: Nx…x16x16

… now, go opposite way - transposed convolution

LayerX+1: ConvTranspose2D, stride=2, filters=…, output: Nx…x32x32

… all the way up …

Layer(Last): ConvTranspose2D, stride=2, filters=…, output: Nx10x1024x1024

Main loop:

```
net = Net()
net.cuda()
net.train()
optimizer = optim.SGD(net.parameters(), lr = ...)
criterion = nn.MSELoss()
input = Variable(torch.from_numpy(<your-Nx10x1024x1024-tensor>).cuda())
target = Variable(torch.from_numpy(<your-Nx10x1024x1024-tensor>).cuda())
for epoch in xrange(...):
output = net(input)
loss = criterion(output, target)
net.zero_grad()
loss.backward()
optimizer.step()
```

On a GTX-1060 corresponding code takes around 30% more GPU RAM than its Theano counterpart (exection times are about the same for Theano and PyTorch versions).

Is this something that can be fixed?

