Let’s say that I have a PyTorch tensor that I’m loading onto CPU. I would now like to experiment with different shapes and how they affect the memory consumption, and I thought the best way to do this is creating a simple random tensor and then measuring the memory consumptions of different shapes. However, while attempting this, I noticed anomalies and I decided to simplify the task further.
I’m creating a 3 GB PyTorch tensor and want to measure its memory consumption with the psutil
module. In order to get some statistics, I do this ten times in a for loop and consider the mean and std. I also move the tensor to GPU and then use PyTorch functions to measure the allocated and totally reserved memory on GPU.
My code is this:
import torch
import psutil
if __name__ == '__main__':
resident_memories = []
for i in range(10):
x = torch.ones((3, 1024, 1024, 1024), dtype=torch.uint8)
resident_memory = psutil.Process().memory_info().rss/1024**2
resident_memories.append(resident_memory)
del x, resident_memory
print('Average resident memory [MB]: {} +/- {}'.format(torch.mean(torch.tensor(resident_memories)), torch.std(torch.tensor(resident_memories))))
del resident_memories
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
alloc_memories = []
reserved_memories = []
for i in range(10):
x = torch.ones((3, 1024, 1024, 1024), dtype=torch.uint8).to(device)
alloc_memory = torch.cuda.memory_allocated(device=device)/1024**2
reserved_memory = torch.cuda.memory_reserved(device=device)/1024**2
alloc_memories.append(alloc_memory)
reserved_memories.append(reserved_memory)
del x, alloc_memory
print('By tensors occupied memory on GPU [MB]: {} +/- {}\nCurrent GPU memory managed by caching allocator [MB]: {} +/- {}'.format(
torch.mean(torch.tensor(alloc_memories)), torch.std(torch.tensor(alloc_memories)), torch.mean(torch.tensor(reserved_memories)),
torch.std(torch.tensor(reserved_memories)))
)
I obtain the following output:
Average resident memory [MB]: 4028.602783203125 +/- 0.06685283780097961
By tensors occupied memory on GPU [MB]: 3072.0 +/- 0.0
Current GPU memory managed by caching allocator [MB]: 3072.0 +/- 0.0
I’m executing this code on a cluster, but I also ran the first part on the cloud and I mostly observed the same behavior. When I ran this on the cluster, it was the only job on the CPU, so other jobs should (hopefully) not affect the memory consumption.
I’d have two quick questions:
i) Why is psutil.Process().memory_info().rss
inaccurate when measuring the memory of a 3 GB tensor?
ii) How can we (correctly) measure the memory of tensors on CPU? One use case might be that the tensor is so huge that moving it onto a GPU might cause a memory error.