File "d:\anaconda3\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "d:\anaconda3\lib\site-packages\fire\core.py", line 468, in _Fire
target=component.__name__)
File "d:\anaconda3\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "main.py", line 103, in train
optimizer.step()
File "d:\anaconda3\lib\site-packages\torch\autograd\grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "d:\anaconda3\lib\site-packages\torch\optim\adam.py", line 107, in step
denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
i set the Batch size=1;but the CUDA out of memory…I don’t know what…
this is my train:
for epoch in range(opt.max_epoch):
loss_mean = 0.0
loss_val = 0.0
loss_meter.reset()
for ii,(data,label) in tqdm(enumerate(train_dataloader),total=len(train_data)):
# train model
input = Variable(data)
input = input.float()
target = Variable(label)
if opt.use_gpu:
input = input.cuda()
target = target.cuda()
optimizer.zero_grad()
score = model(input)
loss = criterion(score,target) / (400 * 190)
loss.backward()
optimizer.step()
the error occurs in optimizer.step() of the first epoch…
Wrapping the forward pass in a torch.no_grad() block will not store any intermediate activations, which would be needed to compute the gradients during the backward pass.
You would get an error in loss.backward(), but you are avoiding it by detaching the loss and setting requires_grad=True in:
loss = Variable(loss, requires_grad = True)
However, your training is still broken and the model will not be updated.
torch.no_grad() should only be used during evaluation and testing, if no gradients should be computed and no parameter updates are needed.
yeah, I didn’t use torch.no_grad in the train before. and i used torch.no_grad in the model.eval . This error will not occur . But this time I don’t know why the error occur in the train .even i set the batch size=1,this error still occurs, i didnt konw why.
If you are not storing the loss directly in e.g. a list or any other tensor, which is attached to the computation graph, your model might just use too much memory.
Are you seeing the OOM issue in the first iteration(s) or later in training?
In the former case, you could try to trade compute for memory via torch.utils.checkpoint, while the latter case points towards storing tensors without detaching them.
File "D:\python\RDANET\main.py", line 197, in <module>
fire.Fire(train)
File "d:\anaconda3\lib\site-packages\fire\core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "d:\anaconda3\lib\site-packages\fire\core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "d:\anaconda3\lib\site-packages\fire\core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "D:\python\RDANET\main.py", line 112, in train
optimizer.step()
File "d:\anaconda3\lib\site-packages\torch\autograd\grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "d:\anaconda3\lib\site-packages\torch\optim\adam.py", line 91, in step
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
RuntimeError: CUDA out of memory. Tried to allocate 1.35 GiB (GPU 0; 15.92 GiB total capacity; 12.89 GiB already allocated; 1.22 GiB free; 13.67 GiB reserved in total by PyTorch)
Is this the storing tensors without detaching them you mentioned?
thank you very much
and the code is my model forward:
image_size = [400,190] , block_nums = 8, num_classes=1000
x = self.maxpool(F.leaky_relu(self.conv1_1(x)))
x = self.maxpool(F.leaky_relu(self.conv2_1(x)))
x = self.maxpool(F.leaky_relu(self.conv3_2(F.leaky_relu(self.conv3_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv4_2(F.leaky_relu(self.conv4_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv5_2(F.leaky_relu(self.conv5_1(x)))))
x = self.dropout(x)
x = x.view(x.size(0), 512 * 12 * 5)
x = F.leaky_relu(self.fc1(x))
x = F.leaky_relu(self.fc2(x))
x = x.reshape(x.size(0),-1, int(self.image_size[0]/2), int(self.image_size[1]/2))
# print('x:',x.size())
'''
x = self.res1(x)
out = self.res2(x)
out = self.res3(out)
out = out + x
out = self.res4(out)
# print('res4:',out.size())
#shuffle
# print(out.shape)
out = self.shuffle(out)
# print(out.shape)
# print('shuffle:',out.size())
out = self.res5(out)
# print(out.shape)
out = np.squeeze(out)
# print(out.shape)
return out
I wrap blocks of the models into a checkpoint . according to the tutorial,But crazy error report.
my code:
def conv_lrelu(in_ch, out_ch, ker_sz, pad):
return nn.Sequential(nn.Conv2d(in_ch, out_ch, ker_sz, padding=pad, bias=False),
nn.LeakyReLU())
def seg1(self, x):
x = self.layer1(x)
x = self.maxpool(x)
return x
def fc(self, x):
x = self.dropout(x)
x = x.view(x.size(0), 512 * 12 * 5)
x = F.leaky_relu(self.fc1(x))
x = F.leaky_relu(self.fc2(x))
x = x.view(x.size(0),-1, int(self.image_size[0]/2), int(self.image_size[1]/2))
def EDSR(self,x):
x = self.res1(x)
out = self.res2(x)
out = self.res3(out)
out = out + x
out = self.res4(out)
out = self.shuffle(out)
out = self.res5(out)
out = np.squeeze(out)
return out
x = checkpoint(self.seg1, x)
x = checkpoint(self.seg2, x)
x = checkpoint(self.seg3, x)
x = checkpoint(self.seg4, x)
x = checkpoint(self.seg5, x)
x = checkpoint(self.fc, x)
out = checkpoint(self.EDSR, x)
In addition, no matter how I adjust the batchsize, there will always be CUDA out of memory.But I have memory 32G.And the program can run normally on this computer before, there will be no CUDA out of memory problem。I really don’t know what went wrong, why it worked well before and now it can’t run。
I have two gpus, the gpu0 have16G,the GPU1 have 16G,
What I am confused about is that the program used to run normally, but now there is always the problem of CUDA out of memory。
I just used another data set to test, the program can run normally,This is the model code for another set of data sets:
input = 128*128,image_size = [160,160] , block_nums = 10, num_classes=1000
x = self.maxpool(F.leaky_relu(self.conv1_1(x)))
x = self.maxpool(F.leaky_relu(self.conv2_1(x)))
x = self.maxpool(F.leaky_relu(self.conv3_2(F.leaky_relu(self.conv3_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv4_2(F.leaky_relu(self.conv4_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv5_2(F.leaky_relu(self.conv5_1(x)))))
x = self.dropout(x)
x = x.view(x.size(0), 512 * 4 * 4)
x = F.leaky_relu(self.fc1(x))#self.fc1 = nn.Linear(512 * 4 * 4, 6400)
x = F.leaky_relu(self.fc2(x))#self.fc2 = nn.Linear(6400, 6400)
x = x.reshape(x.size(0),-1, int(self.image_size[0]/2),int(self.image_size[1]/2))
x = self.res1(x)
out = self.res2(x)
out = self.res3(out)
out = out + x
out = self.res4(out)
out = self.shuffle(out)
out = self.res5(out)
out = np.squeeze(out)
This is the code for the new data set:
input = 160*384,image_size = [400,190] , block_nums = 8, num_classes=1000
x = self.maxpool(F.leaky_relu(self.conv1_1(x)))
x = self.maxpool(F.leaky_relu(self.conv2_1(x)))
x = self.maxpool(F.leaky_relu(self.conv3_2(F.leaky_relu(self.conv3_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv4_2(F.leaky_relu(self.conv4_1(x)))))
x = self.maxpool(F.leaky_relu(self.conv5_2(F.leaky_relu(self.conv5_1(x)))))
x = self.dropout(x)
x = x.view(x.size(0), 512 * 12 * 5)
x = F.leaky_relu(self.fc1(x))#self.fc1 = nn.Linear(512 * 12 * 5, 19000)
x = F.leaky_relu(self.fc2(x))#self.fc2 = nn.Linear(19000, 19000)
x = x.reshape(x.size(0),-1, int(self.image_size[0]/2), int(self.image_size[1]/2))
x = self.res1(x)
out = self.res2(x)
out = self.res3(out)
out = out + x
out = self.res4(out)
out = self.shuffle(out)
out = self.res5(out)
out = np.squeeze(out)
I don’t understand why the new data set can’t work properly, is it because the input and output are too large? But I also reduced the batchsize and still can’t run
Yes, most likely. If neither a batch size of 1 can run nor you are able to use checkpointing, you could try to use model sharding, i.e. executing separate parts of the model on different GPUs.
If also this needs to much memory, you would have to change the model architecture and make sure the GPU requirement meets the available device memory.