How can I release the unused gpu memory?

I tried to del unused variable and use ‘torch.cuda.empty_cache()’ to release the gpu memory. Bute I found the used gpu memory is constantly changing but the maximum value is unchanged.
I build the resnet18 in my own way, but the used gpu memory is obviously larger than the official implementation in So how can I find the reason?
Thanks for attention!

    def forward(self, bottoms):
        # string -> feature_map list
        feature_pool = dict()
        bottoms = bottoms if isinstance(bottoms, list) else [bottoms]
        if (self.get_device_id() >= 0):
            for idx, bottom in enumerate(bottoms):
                bottoms[idx] = bottoms[idx].cuda(device=self.get_device_id())

        for id, i_idx in enumerate(self.input_idx):
            feature_pool["bottom_{}".format(id)] = [bottoms[i_idx]]
        del bottoms

        for idx, dag_node in enumerate(self.dag_list):
            local_bottoms = []
            for x in dag_node.bottoms:

                is_x_depended = any([True if x in dag.bottoms else False for dag in self.dag_list[(idx+1):]])
                if (not is_x_depended) and (x not in self.top_names):
                    del feature_pool[x]
            local_tops = self._modules[dag_node.scope].forward(local_bottoms)

            assert len(dag_node.tops) == len(local_tops)
            for i in range(len(dag_node.tops)):
                feature_pool[dag_node.tops[i]] = [local_tops[i]]
            del local_tops, local_bottoms

        feature_list = []
        for name in self.top_names:
            del feature_pool[name];torch.cuda.empty_cache()

        return feature_list

To release the memory, you would have to make sure that all references to the tensor are deleted and call torch.cuda.empty_cache() afterwards.
E.g. del bottoms should only delete the internal bottoms tensor, while the global one should still be alive.

Also, note that torch.cuda.empty_cache() will not avoid out of memory issues, since the cache is reused, not lost.

1 Like

I del local_bottoms after using it as the input of forward function.

Can you please read my code and point out the reasons for the large ratio of gpu memory? I spend a lot of time but it doesn’t work.

Removing the local reference will not delete the global tensor.
If you cannot free the cache, then a reference is still pointing to the tensor as shown here:

def fun(tensor):
    print(torch.cuda.memory_allocated() / 1024**2)
    # Delete local reference
    del tensor
    print(torch.cuda.memory_allocated() / 1024**2)

# Check that memory is empty
> 0
> 0

# Create tensor
x = torch.randn(1024 * 1024, device='cuda')
print(torch.cuda.memory_allocated() / 1024**2)
> 4.0
print(torch.cuda.memory_cached() / 1024**2)
> 20.0

# Call fun and check, if x is still alive
> 4.0
> 4.0

print(x.device) # still alive
> cuda:0
print(torch.cuda.memory_allocated() / 1024**2)
> 4.0
print(torch.cuda.memory_cached() / 1024**2)
> 20.0

# Delete global tensor
del x
print(torch.cuda.memory_allocated() / 1024**2)
> 0.0
print(torch.cuda.memory_cached() / 1024**2)
> 20.0

# Now empty cache
print(torch.cuda.memory_cached() / 1024**2)
> 0.0

Thanks for replying. I konw your explanation. But I think “del bottoms” not working is not the reason for hight ratio of gpu usage memory.

If I understand your issue correctly you are trying to empty the cache, which doesn’t seem to be working, right?
If that’s the case, you would have to delete all references to the tensors you would like to delete so that the cache can be emptied.

If I misunderstood it, please correct me.

Yes, I tried to avoid using temporary variables and delete unusable variable.In forward function of each module, I delete other feature_map tensor before return result.

Did you check, if the cached memory was decreased using this approach?

If I delete tensor and use empty_cache, the used gpu memory is constantly changed. But the maximum ratio of used gpu memory is not changed.

You won’t avoid the max. memory usage by removing the cache.
As explained before, torch.cuda.empy_cache() will only release the cache, so that PyTorch will have to reallocate the necessary memory and might slow down your code
The memory usage will be the same, i.e. if your training has a peak memory usage of 12GB, it will stay at this value.
You will only temporarily reduce the allocated memory, which will then be reallocated if necessary.

1 Like

I agree your opinion. But when I define the cnn model by code showed in question. The peak memory usage is 4 times larger than offfical resnet model in torchvision.

I would recommend to add debug statements using print(torch.cuda.max_memory_allocated()) to try to narrow down which operations are wasting the memory.

Just by skimming through the code, it seems that some lists and dicts are temporarily used and freed later. This might increase the peak memory, e.g. if you are storing the complete feature maps first and delete them one by one later.

When I delete tensor and use empty_cache, the memory usage will decrease only when one-batch train process done rather than where I use “torch.cuda.empy_cache()”.

Dear ptrblck, I found the gpu memory decrease from 5390M to 256M if I use empy_cache. The decease happens when model infer the validation dataset.

That might be expected and PyTorch will reallocate the memory, if needed.
You can clear the cache, but won’t be able to reduce the peak memory, and might just slow down the code using it.

If your custom ResNet implementation uses more memory than the torchvision implementation, I would still recommend to compare both implementations by adding the mentioned print statements and narrow down which part of your code uses more memory.

1 Like