Hello,
I’m running my code on Tesla P100 (16 Go). l get stuck at CUDA out of memory, l get the following error after 17 epochs of training
line 2234, in forward
x = nn.AvgPool1d(90, stride=None)(x)
File "/local/anaconda3/envs/torch_edward/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/local/anaconda3/envs/torch_edward/lib/python3.7/site-packages/torch/nn/modules/pooling.py", line 499, in forward
self.count_include_pad)
RuntimeError: CUDA out of memory. Tried to allocate 1.12 MiB (GPU 0; 11.91 GiB total capacity; 5.52 GiB already allocated; 2.06 MiB free; 184.00 KiB cached)
The error comes from forward function : " x = nn.AvgPool1d(90, stride=None)(x)"
def forward():
x=self.layer1(x)
x=self.layer2(x)
.
.
.
x=self.layer10(x)
x=nn.AvgPool1d(90,stride=None)(x) # here l get the error
x=x.squeeze(2)
x=self.fc1(x)
return x
I solved that by deleting manually my variables after batch iterations.
Another common solution is :
try:
try:
y = net.forward(train_x)
except RuntimeError as e:
if 'out of memory' in str(e):
print('| WARNING: ran out of memory, retrying batch',sys.stdout)
sys.stdout.flush()
for p in net.parameters():
if p.grad is not None:
del p.grad # free some memory
torch.cuda.empty_cache()
y= net.forward(train_x)
else:
raise e
Weer you able to find out why this is happening? it is very strange since I call optim.zero_grad() inside the training loop. Could it be because cuda is not purging the data from dataloader in every loop and getting accumulated?
Hi @ptrblck, thanks for the reply. I have it already wrapped in train function and call it from the __main__ . I presume this is what you meant? Stil the same error. I tried everything from changing batch_size and reduced the number of parameters. But it throws an error before an epoch completes usually towards the end of epoch like below. also tried calling torch.cuda.empty_cache() inside the training loop with no success either.
Error Message : RuntimeError: CUDA out of memory. Tried to allocate 4.37 GiB (GPU 0; 11.17 GiB total capacity; 4.78 GiB already allocated; 1.58 GiB free; 4.49 GiB cached)
===========
current memory allocated: 4693.4131
Max memory allocated: 9319.9683
Cached memory: 9490.0000
===========
def train(args):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_dataloader, test_dataloader, embedding_tuple = get_loader(args)
model = SmallEmbedNN(embedding_tuple, len(args.cont_cols), 1,
emb_dropout=0.1, lin_layer_dropouts=0.1).to(device)
criterion_l1 = L1Loss()
# sparse optimizer for embeddings and normal optimizer for rest
embed_sparse_vec = 'emb_layer.weight'
opt = torch.optim.Adam([weights for name, weights in model.state_dict().items() \
if name!=embed_sparse_vec], lr=args.lr)
optSparse = torch.optim.SparseAdam([model.state_dict()[embed_sparse_vec]],
lr=args.lr)
train_loss_list=[]
test_loss_list=[]
try:
for epoch in range(args.epochs):
model.train()
for _dict in tqdm(train_dataloader):
# pass vars to cuda device
_dict = {key:var.to(device) for key, var in _dict.items()}
# zeroing parameter gradients
opt.zero_grad()
optSparse.zero_grad()
# Forward Pass
preds = model(_dict['cont'], _dict['cat'])
loss_train_l1 = criterion_l1(preds, _dict['target'],
yr_weights=_dict['wts']) #L1 loss
# Backward Pass and Optimization
loss_train_l1.backward()
opt.step()
optSparse.step()
# batch train loss
train_loss = loss_train_l1.item()
logger.info(f'Epoch:{epoch} train[batch] loss(L1):{train_loss:.4f}')
# defrag cached memory
torch.cuda.empty_cache()
# append train loss
train_loss_list.append(train_loss)
# evaluate test-set and append test loss
test_loss = eval_test(model, test_dataloader, criterion_l1, device)
test_loss_list.append(test_loss)
if epoch%1==0:
logger.info(f'Epoch:{epoch}\ttrain_loss:{train_loss_list[-1]:.6f} \
\ttest_loss:{test_loss_list[-1]:.6f}')
# earlystopping if test loss is not improving in last 10 epochs
if min(test_loss_list) < min(test_loss_list[-20:]):
break
except Exception as e:
logger.exception(e)
memory_log(logger)
@ptrblck Is there any particular method/function to purge the training data out of CUDA after each batch pass? Or does it get emptied by some sort of garbage collection? Or could it be anything to do with sparse optimizer?
@ptrblck Found what is causing the error. It is the Sparse optimizer. When I do not use a sparse embedding layer or optimizer, there seem to be no problem at all. Do you know any such issues relating to sparse optimizer?
@smth Hi Soumith, Have you seen sparse optimizer causing CUDA error. I am guessing it is caused by accumulating params inside GPU inside training loop? I highly appreciate any help in the correct direction. Thanks a lot!
Hi @ptrblck Thanks for the reply !
I am not sure if I am running out of memory when I hit a specific case, since I tried out with various sizes of train batches updating different number of parameters. Besides, error always throws towards the end of 1st epoch, never in the beginning. I also observed that memory keeps on increasing(not monotonically but trendwise) when using sparse optimizer. Anyways, I will keep investigating.
Could you post a code snippet showing this behavior?
How did you define eval_test?
Could you check that all tensors, which you are appending to a list, are properly detached from the computation graph?
def eval_test(model, test_dataloader, criterion, device):
"""Evaluation model on test set"""
loss = []
model.eval()
with torch.no_grad():
for _dict in test_dataloader:
_dict = {key:var.to(device) for key, var in _dict.items()}
preds = model(_dict['cont'], _dict['cat'])
loss.append(criterion(preds, _dict['target'], yr_weights=_dict['wts']).item())
return np.mean(loss)
I believe .item() is detaching it. Also I am using aws cloud instance, could this happen due to any version compatibility? I am so puzzled by this bug.