recently,i found an issue when i training the model after the one batch,
here the error log:
Traceback (most recent call last):
File "main.py", line 329, in <module>
File "main.py", line 306, in main
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 6.25 GiB (GPU 0; 10.92 GiB total capacity; 868.51 MiB already allocated; 3.57 GiB free; 5.87 GiB cached)
here is my code:
opt = torch.optim.Adam(model.parameters(), lr=Config.learning_rate, weight_decay=Config.L2)
for epoch in range(epochs):
for i, str2var in enumerate(train_batcher):
print("batch number:", i)
print(f'Before forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'Before forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
e1 = str2var['e1'].cuda()
rel = str2var['rel'].cuda()
e2_multi = str2var['e2_multi1_binary'].float().cuda()
# label smoothing
e2_multi = ((1.0-Config.label_smoothing_epsilon)*e2_multi) + (1.0/e2_multi.size(1))
pred = model.forward(e1, rel, X, adjacencies)
print(f'After forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'After forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
loss = model.loss(pred, e2_multi)
print(f'After backprop - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'After backprop - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
train_batcher.state.loss = loss.cpu()
del loss, pred
print(f'After manually collecting garbage - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
print(f'After manually collecting garbage - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
print('saving to {0}'.format(model_path))
torch.save(model.state_dict(), model_path)
i add some code to print the cuda memeory,here the results:
After forward pass - Cuda memory allocated: 0.790011392
After forward pass - Cuda memory cached: 0.847249408
After backprop - Cuda memory allocated: 0.428828672
After backprop - Cuda memory cached: 7.635730432
After manually collecting garbage - Cuda memory allocated: 0.54807808
After manually collecting garbage - Cuda memory cached: 7.216300032
batch number: 1
Before forward pass - Cuda memory allocated: 0.54807808
Before forward pass - Cuda memory cached: 7.216300032
After forward pass - Cuda memory allocated: 0.999488
After forward pass - Cuda memory cached: 7.218397184
this issue confused me a several days
i dont understand what occupying tha GPU cache
how do i fix this problem?
i’m new in cuda.
thanks a lot!!!!
other code in forward():
def forward(self, e1, rel, X, A):
#model.forward(e1, rel, X, adjacencies)
emb_initial = self.emb_e(X)
x = self.gc1(emb_initial, A)
x = self.bn3(x)
x = F.tanh(x)
x = F.dropout(x, Config.dropout_rate, training=self.training)
x = self.bn4(self.gc2(x, A))
e1_embedded_all = F.tanh(x)
e1_embedded_all = F.dropout(e1_embedded_all, Config.dropout_rate, training=self.training)
e1_embedded = e1_embedded_all[e1]
rel_embedded = self.emb_rel(rel)
stacked_inputs = torch.cat([e1_embedded, rel_embedded], 1)
stacked_inputs = self.bn0(stacked_inputs)
x= self.inp_drop(stacked_inputs)
x= self.conv1(x)
x= self.bn1(x)
x= F.relu(x)
x = self.feature_map_drop(x)
x = x.view(Config.batch_size, -1)
x = self.fc(x)
x = self.hidden_drop(x)
x = self.bn2(x)
x = F.relu(x)
x = torch.mm(x, e1_embedded_all.transpose(1, 0))
pred = F.sigmoid(x)
return pred