Hi there,
I come across with a strange GPU memory issue when training my model. GPU memory increases dramatically and never got released after loss.backward()
, and then an “CUDA out of memory” error comes up. I feel it’s a memory leak but I just don’t know where the problem is.
I’d greatly appreciate your help. Thanks in advance!
My ec2 envrionment:
OS: linux
Python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
PyTorch: 1.2.0.dev20190603
Numpy: 1.15.4
GPU: ['Tesla V100-SXM2-16GB']
CUDA Version 9.0.176CUDA Patch Version 9.0.176.1CUDA Patch Version 9.0.176.2CUDA Patch Version 9.0.176.3CUDA Patch Version 9.0.176.4
CuDNN Version 7.3.1
Error log
I added some CUDA memory tracker. We can see that the cuda memory increase after the loss back propagation step, especially the cached memory. The memory is mainly taken by the embedding layer which embed items to vectors. I tried to manually deleted all the tensors and cleared cached memory. It seems the cached memory is successfully cleared but something is still residing in GPU.
Now starting model fitting.
#items: 12879226
#Customers: 4407496
Model loaded
Before forward pass - Cuda memory allocated: 3.297393664
Before forward pass - Cuda memory cached: 3.300917248
After forward pass - Cuda memory allocated: 3.304946688
After forward pass - Cuda memory cached: 3.309305856
After backprop - Cuda memory allocated: 6.59643904
After backprop - Cuda memory cached: 13.207863296
After manually collecting garbage - Cuda memory allocated: 6.59593728
After manually collecting garbage - Cuda memory cached: 6.603931648
Before forward pass - Cuda memory allocated: 6.59593728
Before forward pass - Cuda memory cached: 6.603931648
After forward pass - Cuda memory allocated: 6.602330112
After forward pass - Cuda memory cached: 6.608125952
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "PBRS_Models/__main__.py", line 37, in <module>
args["func"](package_parameters)
File "PBRS_Models/src/temporal_cnn/main.py", line 169, in main
args["func"](args)
File "PBRS_Models/src/temporal_cnn/main.py", line 362, in parse_train_args
loading_procs
File "PBRS_Models/src/temporal_cnn/train.py", line 76, in main
model_fitting.main(dir_processed_data, dir_results, device, is_resume, **config_params)
File "PBRS_Models/src/temporal_cnn/model_fitting.py", line 268, in main
model, train_data, optimiser, epoch, device, **config_params
File "PBRS_Models/src/temporal_cnn/model_fitting.py", line 154, in train_epoch
loss.backward()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 3.07 GiB (GPU 0; 15.75 GiB total capacity; 12.29 GiB already allocated; 1.19 GiB free; 9.19 MiB cached)
Code snippet
Data_source is a generator used to load data and I am using SGD optimizer and CrossEntropyLoss criterion.
def train_epoch(model, data_source, optimiser, epoch, device, **config_params):
"""Function to train one epoch.
:param model: The pytorch model to be trained
:type model: torch.nn.module
:param data_source: Training data generator
:type data_source: generator
:param optimiser: The pytorch optimiser used to optimise model loss function
:type optimiser: torch.optimiser
:param epoch: Epoch step
:type epoch: int
:param device: The device to load the model onto (CPU or CUDA device).
:type device: torch.device
:param config_params: Other configuration parameters used to control model training.
:type config_params: dict of str->Object
"""
model.train()
total_loss = 0
start_time = time.time()
for batch_idx in range(1, data_source.num_batches+1):
# Calculate the loss and run the backpropagation.
batch = next(data_source)
batch = batch.to(device)
inputs, targets = Variable(batch[:,:-1]), Variable(batch[:,1:])
optimiser.zero_grad()
print(f'Before forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'Before forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
logits, new_targets = model(inputs, targets)
print(f'After forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'After forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
loss = criterion(logits.view(-1, config_params["softmax_nsampled"]+1), new_targets)
loss.backward()
print(f'After backprop - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'After backprop - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
if config_params["clip"] > 0:
nn.utils.clip_grad_norm_(model.parameters(), config_params["clip"])
optimiser.step()
total_loss += loss.item()
del logits, new_targets, loss, inputs, targets
torch.cuda.empty_cache()
print(f'After manually collecting garbage - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'After manually collecting garbage - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
log_interval = config_params["log_interval"]
if batch_idx % log_interval == 0:
cur_loss = total_loss/log_interval
elapsed = time.time() - start_time
LOGGER.info('| Epoch: {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.5f} | '
'loss {:5.2f} |'.format(
epoch, batch_idx, data_source.num_batches, optimiser.param_groups[0]['lr'],
elapsed * 1000 / log_interval, cur_loss))
total_loss = 0
start_time = time.time()