CUDA is out of memory

zhishen_nie · February 18, 2020, 5:09pm

Recently, I use pytorch to generate some adversarial samples, and the algothim is FGSM. Just like
https://pytorch.org/tutorials/beginner/fgsm_tutorial.html . It was very smooth at the beginning of mu program. But soon pytorch told me that cuda is out of memory. I guess there will be a part of the GPU memory has not been released. And I know torch.cuda.empty_cache(）but it didn’t work. So how could I resolve this problems? Thanks for anyone who could help me!

zhishen_nie · February 18, 2020, 5:11pm

def fgsm_attack(image, epsilon, data_grad):

sign_data_grad = data_grad.sign()

perturbed_image = image + epsilon*sign_data_grad

perturbed_image = t.clamp(perturbed_image, 0, 255)

return perturbed_image

def gen_adv(model, device, data, target, epsilon):

data, target = data.to(device), target.to(device)
data.requires_grad = True

output = model(data)
init_pred = output.max(1, keepdim=True)[1]
if init_pred.item() != target.item():
    return

loss = F.nll_loss(output, target)
model.zero_grad()
loss.backward()
data_grad = data.grad.data
data = unnormalized_show(data)

perturbed_data = fgsm_attack(data, epsilon, data_grad)
perturbed_data = perturbed_data.cpu().detach().numpy()
perturbed_data = perturbed_data.reshape(3, 224, 224)
perturbed_data = np.transpose(perturbed_data, (1, 2, 0))

return perturbed_data

albanD · February 18, 2020, 6:02pm

Hi,

Have you tried reducing the batch size?
Also make sure that you do not save any Tensor in a list or something similar during training.

zhishen_nie · February 19, 2020, 3:33am

Well, the model has been trained, and when I use the model to generate adversarial samples , the problem appear
the error is :
Traceback (most recent call last):
File “gen_adv.py”, line 28, in
perturbed_data = Attack.gen_adv(model, device, data, target, eps)
File “/content/drive/My Drive/Colab Notebooks/animal10/Attack.py”, line 42, in gen_adv
loss.backward()
File “/usr/local/lib/python3.6/dist-packages/torch/tensor.py”, line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 7.43 GiB total capacity; 5.78 GiB already allocated; 392.94 MiB free; 6.53 GiB reserved in total by PyTorch)

It seems that " loss.backward() " is the key, and I don’t save any Tensor in a list or something similar…
The model can withstand the input of dozens of pictures, but when the number of pictures exceeds thousands, the program fails…
Thanks for answer my question:)

JakeAndFinn · February 19, 2020, 4:50am

Sometimes I found the restarting you computer works.
It may have something in the memory or holding on to something that you dont know.
If it still keeps doing it, make a data loader and do mini-batches.
That may help

zhishen_nie · February 19, 2020, 3:03pm

well, I find the key of this problem. The following function, each time it runs, the GPU memory consumed will increase by 1mb. But I really don’t know how to solve it. QAQ

def gen_adv(model, device, data, target, epsilon):

data, target = data.to(device), target.to(device)


data.requires_grad = True

# 1067mb  1068mb
output = model(data)
# 1202mb  1203mb

init_pred = output.max(1, keepdim=True)[1]
if init_pred.item() != target.item():
    return
 # 1202mb  1203mb
loss = F.nll_loss(output, target)
 # 1202mb  1203mb
model.zero_grad()
 # 1202mb  1203mb
loss.backward()
 # 1068mb  1069mb
data_grad = data.grad.data
 # 1068mb  1069mb
data = unnormalized_show(data)
 # 1068 mb 1069mb


perturbed_data = fgsm_attack(data, epsilon, data_grad)
perturbed_data = perturbed_data.cpu().detach().numpy()


perturbed_data = perturbed_data.reshape(3, 224, 224)
perturbed_data = np.transpose(perturbed_data, (1, 2, 0))

return perturbed_data

ptrblck · February 20, 2020, 1:16am

Duplicate post from here.
Let’s please keep the discussion in your created topic.