Hi,
I’ve been trying to run copies of my model on multiple GPUs on a local machine.
When running a loop to move the model across GPU devices the CPU memory keeps increasing, eventually leading to an out of memory exception. (Later during training.Tracing it back got me to this point.)
The problem is not the CUDA context. I’ve tried initializing a tensor to CUDA beforehand and indeed that spike is taken into account.
It looks like there remains a duplicate of the entire model on the CPU for every instance on another GPU.
Any thoughts? I’ve spent an entire day trying to work around it, and I’ve already stepped away from DDP for the same reason. (Figured it might be multiple CUDA contexts)
print("Prior usage", int(psutil.virtual_memory().used) / 1024 ** 2)
torch.zeros(100).to("cuda")
print("Cuda init", int(psutil.virtual_memory().used) / 1024 ** 2)
model = load_model()
print("Loaded model cpu", int(psutil.virtual_memory().used) / 1024 ** 2)
param_size = 0
for param in model.parameters():
param_size += param.nelement() * param.element_size()
buffer_size = 0
for buffer in model.buffers():
buffer_size += buffer.nelement() * buffer.element_size()
size_all_mb = (param_size + buffer_size) / 1024 ** 2
print('model size: {:.3f}MB'.format(size_all_mb))
models = []
for rank in range(6):
models.append(model.to(rank))
print("Moved model to rank", rank, int(psutil.virtual_memory().used) / 1024 ** 2)
Prior usage 3148.53515625
Cuda init 4778.109375
Loaded model cpu 5472.44921875
model size: 659.675MB
Moved model to rank 0 5473.4453125
Moved model to rank 1 5540.5859375
Moved model to rank 2 5936.28125
Moved model to rank 3 6521.75390625
Moved model to rank 4 7109.11328125
Moved model to rank 5 7694.27734375
You can see the first three GPUs it doesn’t really care. But after that RAM seems to fill up in nearly equal amounts to the model size.