Ah I see Iām trying to get model parallelism and data parallelism work so that I can (hopefully) use multiple GPUs.
Can you please clarify about restarting. Sometimes I fill up cuda with settings that are too much for it, and I want to adjust settings and try again. Is running torch.cuda.empty_cache
equivalent to rebooting the machine? Or is it equivalent to closing and re-opening python?
Thank you
neither of them. It is just returning to the OS the memory not actively used right now.
Restarting python will clear everything used by pytorch.
Restarting the OS will restart the GPU completely hence clearing everything even non-pytorch related.
Thank you @albanD. It seems like for what Iām doing (testing what my GPUs can handle without overloading them), all I need is to run torch.cuda.empty_cache
and potentially restart python since Iām only using the server for pytorch right now.
Yes restarting python is the right thing to do to make sure everything works fine again after a memory error.
Many thanks for your guidance @albanD!
Hi
thanks for great replies. In my case I have trained the model on GPU. Now I am using saved model in other code to check the accuracy of my trained network. I tried torch.cuda.empty_cache()
but it is not working.
I think this is caused by saved variable on my GPU. I am attaching my code here so you can have batter idea.
model = torch.load('Two_layer_transpose_CNN.pth')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
torch.cuda.empty_cache()
Now I have trained model to build high-scale image from low-scale image. And in testing I am getting this error.
output = []
for i_batch, sample_batched in enumerate(Data_Loader):
#print(i_batch)
input = sample_batched['small_image'].float().to(device)
i = model(input).to(device)
print(sample_batched['small_image'].shape)
output.append(i)
this is the output with error.
torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 678, 1020])
.
.
.
torch.Size([1, 3, 678, 1020])
torch.Size([1, 3, 696, 1020])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-14-7986a773d6c9> in <module>
3 #print(i_batch)
4
----> 5 input = sample_batched['small_image'].float().to(device)
6 i = model(input).to(device)
7 print(sample_batched['small_image'].shape)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 2.74 GiB already allocated; 294.40 KiB free; 2.78 GiB reserved in total by PyTorch)
I am stuck here.
Please help if you can. Thanks in Advance.
I realised that I am having a similar issue as you. I am dealing with a problem where the input from each batch has a different length. Currently, I am using empty.cache() to avoid OOM issues. Otherwise I always get an OOM error after 20-30 epochs.
Did you finally figure out how to solve this problem?
def capture_gradients(args, model, save_dir, dataset, masking_function = None, output_hidden_states = False, loss_func = CrossEntropyLoss(), verbose=False) -> torch.Tensor:
model.eval()
accumulated_gradient = {}
dataloader = DataLoader(dataset, batch_size=1, shuffle=False)
for example in dataloader:
monitor_all_gpus(args.logger)
torch.cuda.empty_cache()
outputs = model(**example, output_hidden_states=output_hidden_states)
shift_logits = outputs.logits[..., :-1, :].contiguous() # Get rid of the prediction from the last token, since we don't have a label for it
shift_labels = example["input_ids"][..., 1:].contiguous() # Get rid of the label from the first token, since no predictions are made for it
shift_logits = shift_logits.view(-1, model.config.vocab_size)
shift_labels = shift_labels.view(-1)
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_func(shift_logits, shift_labels)
loss.backward()
for name, param in model.named_parameters(): # This is grabs all the params grabbed by model.parameters(), in fact, model.parameters() calls this function under the hood. https://pytorch.org/docs/stable/_modules/torch/nn/modules/module.html#Module.named_parameters
accumulated_gradient[name] = param.grad
torch.cuda.empty_cache()
monitor_all_gpus(args.logger)
return accumulated_gradient
I am trying to capture the gradients of an LLM. It is possible to fully fine tune the LLM on my setup by using the huggingface Trainer abstraction, so there is no issue with the amount of GPU memory I have.
However, the only way I have been able to make this function work is by adding torch.cuda.empty_cache(). Otherwise, I run out of memory after 3-4 batches. I wonder what could be causing tensors to linger in memory.