Cuda of memory of my code

xclmj · August 21, 2019, 4:32pm

Hi, all,

    I implement my code about an Autoendocer with dense blcok on the basis of Pytorch.  The total number of parameters around 0.8M.  The image size is 100*100.  the batchsize for training 100. However, when I launch CUDA, an error "cuda out of memory" will be ocurred.  Can anyone can help me. My GPU card :

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|===========================================================================

Thanks.

albanD · August 21, 2019, 6:59pm

Hi,

There is not much you can do to reduce GPU memory usage except reducing the batch size and use inplace operations for activations like ReLUs if possible (if you see an error saying that a tensor needed for backward has been changed by an inplace operation, then that means that you cannot use them at that particular place).

xclmj · August 21, 2019, 7:25pm

Hi,

    Thanks for your replying.  My GPU card only has around 4G memory.  This error will be resolved if I use a GPU card with larger memory?

best regard

ptrblck · August 21, 2019, 9:51pm

This might be the case. If it’s possible to post the model definition and some dummy inputs (and targets) I could check the memory usage on larger GPUs.

xclmj · August 22, 2019, 9:21am

Hi, ptrblck,

Thanks for your kind help.  The following link is the github of this code: 
   https://github.com/xclmj/Dense-Convolutional-Encoder-Decoder-Network-with-Dense-Block.git

You can download the data from : https://drive.google.com/drive/folders/1VkYtS2oe-vwapUjwIG_0GdFzR_RMw2xW?usp=sharing.

best regard

ptrblck · August 22, 2019, 11:53am

Thanks for the code.
I’ve used the default arguments and this script to run the test:

device = 'cuda'

# enters time in latent
model = DenseEDT(1,
                 2,
                 blocks=(4, 9, 4),
                 times=(5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100),
                 growth_rate=24,
                 drop_rate=0,
                 bn_size=100,
                 num_init_features=48,
                 bottleneck=False,
                 time_channels=1).to(device)


optimizer = torch.optim.Adam(model.parameters(), lr=0.001,
                             weight_decay=5e-4)
criterion = nn.MSELoss()

data = torch.randn(10, 1, 224, 224, device=device)
times = torch.randn(10, device=device)

output = model(data, times)
target = torch.empty_like(output).normal_()

loss = criterion(output, target)
loss.backward()

print('mem allocated {:.3f}MB'.format(torch.cuda.memory_allocated()/1024**2))
> mem allocated 16.433MB
print('max mem allocated {:.3f}MB'.format(torch.cuda.max_memory_allocated()/1024**2))
> max mem allocated 2146.784MB
print('mem cached {:.3f}MB'.format(torch.cuda.memory_cached() / 1024**2))
> mem cached 2440.000MB

nvidia-smi shows a usage of approx. 3500MB using a Titan V GPU with 12GB memory.
The additional usage comes from the cuda context, which will take some space on the GPU.