What exactly is occupying the GPU cache?

hello,all
recently,i found an issue when i training the model after the one batch,
here the error log:

Traceback (most recent call last):
  File "main.py", line 329, in <module>
    main()
  File "main.py", line 306, in main
    loss.backward()
 File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
 File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 6.25 GiB (GPU 0; 10.92 GiB total capacity; 868.51 MiB already allocated; 3.57 GiB free; 5.87 GiB cached)

here is my code:

opt = torch.optim.Adam(model.parameters(), lr=Config.learning_rate, weight_decay=Config.L2)
    for epoch in range(epochs):
        model.train()
        for i, str2var in enumerate(train_batcher):
            print("batch number:", i)
            #torch.cuda.empty_cache()
            opt.zero_grad()
            print(f'Before forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
            print(f'Before forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
            e1 = str2var['e1'].cuda()
            rel = str2var['rel'].cuda()
            e2_multi = str2var['e2_multi1_binary'].float().cuda()
            # label smoothing
            e2_multi = ((1.0-Config.label_smoothing_epsilon)*e2_multi) + (1.0/e2_multi.size(1))
            pred = model.forward(e1, rel, X, adjacencies)
            print(f'After forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
            print(f'After forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
            loss = model.loss(pred, e2_multi)
            loss.backward()
            print(f'After backprop - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
            print(f'After backprop - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
            opt.step()
            train_batcher.state.loss = loss.cpu()
            del loss, pred
            torch.cuda.empty_cache()
            print(f'After manually collecting garbage - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
            print(f'After manually collecting garbage - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')


        print('saving to {0}'.format(model_path))
        torch.save(model.state_dict(), model_path)

i add some code to print the cuda memeory,here the results:

After forward pass - Cuda memory allocated: 0.790011392
After forward pass - Cuda memory cached: 0.847249408
After backprop - Cuda memory allocated: 0.428828672
After backprop - Cuda memory cached: 7.635730432
After manually collecting garbage - Cuda memory allocated: 0.54807808
After manually collecting garbage - Cuda memory cached: 7.216300032
batch number: 1
Before forward pass - Cuda memory allocated: 0.54807808
Before forward pass - Cuda memory cached: 7.216300032
After forward pass - Cuda memory allocated: 0.999488
After forward pass - Cuda memory cached: 7.218397184

this issue confused me a several days
i dont understand what occupying tha GPU cache
how do i fix this problem?
i’m new in cuda.
thanks a lot!!!!

other code in forward():

def forward(self, e1, rel, X, A):
        #model.forward(e1, rel, X, adjacencies)
        emb_initial = self.emb_e(X)
        x = self.gc1(emb_initial, A)
        x = self.bn3(x)
        x = F.tanh(x)
        x = F.dropout(x, Config.dropout_rate, training=self.training)
      
        x = self.bn4(self.gc2(x, A))
        e1_embedded_all = F.tanh(x)
        e1_embedded_all = F.dropout(e1_embedded_all, Config.dropout_rate, training=self.training)
        e1_embedded = e1_embedded_all[e1]
        rel_embedded = self.emb_rel(rel)
        stacked_inputs = torch.cat([e1_embedded, rel_embedded], 1)
        stacked_inputs = self.bn0(stacked_inputs)
        x= self.inp_drop(stacked_inputs)
        x= self.conv1(x)
        x= self.bn1(x)
        x= F.relu(x)
        x = self.feature_map_drop(x)
        x = x.view(Config.batch_size, -1)
        x = self.fc(x)
        x = self.hidden_drop(x)
        x = self.bn2(x)
        x = F.relu(x)
        x = torch.mm(x, e1_embedded_all.transpose(1, 0))
        pred = F.sigmoid(x)

        return pred

PyTorch uses a memory caching mechanism, which reuses already allocated device memory to avoid the expensive memory allocation calls.
E.g. your training loop might use 7GB of memory to store the model parameters, input data, and the intermediate tensors, which are needed to calculate the gradients in the backward pass.
Once the backward pass is done and the gradients are calculated, the intermediates are not longer needed and can be freed.
However, to avoid free and alloc calls, this memory is pushed to the cache and reused later.

thanks so much for your reply!
so i got the ‘cuda out of memory’ problem exactly due to my GPU memory is not enough?
(my GPU:1080Ti )
but i trained my model on a 24G GPU ,it got the same problems.
it’s really puzzled me

Yes, this error is raised if you are running out of memory.
Could you check, if the used device is free before running your script?
Also, try to reduce the batch size as much as possible to check the minimal necessary memory usage.

thanks!
i checked the GPU before i run my script,i it’s free
i tried to reduce the batch size until 2,it still raise the oom error…