I recently ran into a problem with cuda memory leakage. 8 GPUs ran out of their 12GB of memory after a certain number of training steps. And I noticed that the GPU memory usage was stacking up gradually. There were about 40MB of memory usage per GPU increased every step, after forcing an update on os using
torch.cuda.empty_cache(). Since my training code is fairly simple, I suspect there is something fishy going on in those encapsulated modules. Has anyone encountered a similar problem before?
Following is the main body of the training code
for epoch in range(nepoch): for im, rois_h, rois_o, scores, ip, labels in dataloader: # relocate tensors to cuda im = torch.cat([im] * nGPUs, 0).to(device) rois_h = rois_h.to(device) rois_o = rois_o.to(device) scores = scores.to(device) ip = ip.to(device) labels = labels.to(device) # zero the parameter gradients optimizer.zero_grad() # perform forward pass out = net(im, rois_h, rois_o, scores, ip) # compute loss loss = criterion(out, labels.float()) # perform back propogation loss.backward() optimizer.step() # clean up cache torch.cuda.empty_cache()