How can I release the unused gpu memory?

yf_zhang · May 19, 2020, 2:49am

I tried to del unused variable and use ‘torch.cuda.empty_cache()’ to release the gpu memory. Bute I found the used gpu memory is constantly changing but the maximum value is unchanged.
I build the resnet18 in my own way, but the used gpu memory is obviously larger than the official implementation in torch.vision. So how can I find the reason?
Thanks for attention!

    def forward(self, bottoms):
        # string -> feature_map list
        feature_pool = dict()
        bottoms = bottoms if isinstance(bottoms, list) else [bottoms]
        if (self.get_device_id() >= 0):
            for idx, bottom in enumerate(bottoms):
                bottoms[idx] = bottoms[idx].cuda(device=self.get_device_id())

        for id, i_idx in enumerate(self.input_idx):
            feature_pool["bottom_{}".format(id)] = [bottoms[i_idx]]
        del bottoms
        torch.cuda.empty_cache()

        for idx, dag_node in enumerate(self.dag_list):
            local_bottoms = []
            for x in dag_node.bottoms:
                local_bottoms.extend(feature_pool[x])

                is_x_depended = any([True if x in dag.bottoms else False for dag in self.dag_list[(idx+1):]])
                if (not is_x_depended) and (x not in self.top_names):
                    del feature_pool[x]
                    torch.cuda.empty_cache()
            local_tops = self._modules[dag_node.scope].forward(local_bottoms)

            assert len(dag_node.tops) == len(local_tops)
            for i in range(len(dag_node.tops)):
                feature_pool[dag_node.tops[i]] = [local_tops[i]]
            del local_tops, local_bottoms
            torch.cuda.empty_cache()

        feature_list = []
        for name in self.top_names:
            feature_list.extend(feature_pool[name])
            del feature_pool[name];torch.cuda.empty_cache()

        return feature_list

ptrblck · May 19, 2020, 9:59am

To release the memory, you would have to make sure that all references to the tensor are deleted and call torch.cuda.empty_cache() afterwards.
E.g. del bottoms should only delete the internal bottoms tensor, while the global one should still be alive.

Also, note that torch.cuda.empty_cache() will not avoid out of memory issues, since the cache is reused, not lost.

yf_zhang · May 19, 2020, 1:18pm

I del local_bottoms after using it as the input of forward function.

yf_zhang · May 20, 2020, 3:03am

Can you please read my code and point out the reasons for the large ratio of gpu memory? I spend a lot of time but it doesn’t work.

ptrblck · May 20, 2020, 4:59am

Removing the local reference will not delete the global tensor.
If you cannot free the cache, then a reference is still pointing to the tensor as shown here:

def fun(tensor):
    print(torch.cuda.memory_allocated() / 1024**2)
    # Delete local reference
    del tensor
    print(torch.cuda.memory_allocated() / 1024**2)
    return

# Check that memory is empty
print(torch.cuda.memory_allocated())
> 0
print(torch.cuda.memory_cached())
> 0

# Create tensor
x = torch.randn(1024 * 1024, device='cuda')
print(torch.cuda.memory_allocated() / 1024**2)
> 4.0
print(torch.cuda.memory_cached() / 1024**2)
> 20.0

# Call fun and check, if x is still alive
fun(x)
> 4.0
> 4.0

print(x.device) # still alive
> cuda:0
print(torch.cuda.memory_allocated() / 1024**2)
> 4.0
print(torch.cuda.memory_cached() / 1024**2)
> 20.0

# Delete global tensor
del x
print(torch.cuda.memory_allocated() / 1024**2)
> 0.0
print(torch.cuda.memory_cached() / 1024**2)
> 20.0

# Now empty cache
torch.cuda.empty_cache()
print(torch.cuda.memory_cached() / 1024**2)
> 0.0

yf_zhang · May 20, 2020, 5:51am

Thanks for replying. I konw your explanation. But I think “del bottoms” not working is not the reason for hight ratio of gpu usage memory.

ptrblck · May 20, 2020, 6:07am

If I understand your issue correctly you are trying to empty the cache, which doesn’t seem to be working, right?
If that’s the case, you would have to delete all references to the tensors you would like to delete so that the cache can be emptied.

If I misunderstood it, please correct me.

yf_zhang · May 20, 2020, 6:35am

Yes, I tried to avoid using temporary variables and delete unusable variable.In forward function of each module, I delete other feature_map tensor before return result.

ptrblck · May 20, 2020, 6:41am

Did you check, if the cached memory was decreased using this approach?

yf_zhang · May 20, 2020, 6:54am

If I delete tensor and use empty_cache, the used gpu memory is constantly changed. But the maximum ratio of used gpu memory is not changed.

ptrblck · May 20, 2020, 6:58am

You won’t avoid the max. memory usage by removing the cache.
As explained before, torch.cuda.empy_cache() will only release the cache, so that PyTorch will have to reallocate the necessary memory and might slow down your code
The memory usage will be the same, i.e. if your training has a peak memory usage of 12GB, it will stay at this value.
You will only temporarily reduce the allocated memory, which will then be reallocated if necessary.

yf_zhang · May 20, 2020, 7:02am

I agree your opinion. But when I define the cnn model by code showed in question. The peak memory usage is 4 times larger than offfical resnet model in torchvision.

ptrblck · May 20, 2020, 7:07am

I would recommend to add debug statements using print(torch.cuda.max_memory_allocated()) to try to narrow down which operations are wasting the memory.

Just by skimming through the code, it seems that some lists and dicts are temporarily used and freed later. This might increase the peak memory, e.g. if you are storing the complete feature maps first and delete them one by one later.

yf_zhang · May 20, 2020, 7:08am

When I delete tensor and use empty_cache, the memory usage will decrease only when one-batch train process done rather than where I use “torch.cuda.empy_cache()”.

yf_zhang · May 20, 2020, 8:28am

Dear ptrblck, I found the gpu memory decrease from 5390M to 256M if I use empy_cache. The decease happens when model infer the validation dataset.

ptrblck · May 21, 2020, 7:48am

That might be expected and PyTorch will reallocate the memory, if needed.
You can clear the cache, but won’t be able to reduce the peak memory, and might just slow down the code using it.

If your custom ResNet implementation uses more memory than the torchvision implementation, I would still recommend to compare both implementations by adding the mentioned print statements and narrow down which part of your code uses more memory.