CPU usage increase for each model.to(device)

Hi,

I’ve been trying to run copies of my model on multiple GPUs on a local machine.
When running a loop to move the model across GPU devices the CPU memory keeps increasing, eventually leading to an out of memory exception. (Later during training.Tracing it back got me to this point.)

The problem is not the CUDA context. I’ve tried initializing a tensor to CUDA beforehand and indeed that spike is taken into account.

It looks like there remains a duplicate of the entire model on the CPU for every instance on another GPU.
Any thoughts? I’ve spent an entire day trying to work around it, and I’ve already stepped away from DDP for the same reason. (Figured it might be multiple CUDA contexts)

    print("Prior usage", int(psutil.virtual_memory().used) / 1024 ** 2)
    torch.zeros(100).to("cuda")
    print("Cuda init", int(psutil.virtual_memory().used) / 1024 ** 2)
    model = load_model()
    print("Loaded model cpu", int(psutil.virtual_memory().used) / 1024 ** 2)

    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    size_all_mb = (param_size + buffer_size) / 1024 ** 2
    print('model size: {:.3f}MB'.format(size_all_mb))

    models = []
    for rank in range(6):
        models.append(model.to(rank))
        print("Moved model to rank", rank, int(psutil.virtual_memory().used) / 1024 ** 2)
Prior usage 3148.53515625
Cuda init 4778.109375
Loaded model cpu 5472.44921875
model size: 659.675MB
Moved model to rank 0 5473.4453125
Moved model to rank 1 5540.5859375
Moved model to rank 2 5936.28125
Moved model to rank 3 6521.75390625
Moved model to rank 4 7109.11328125
Moved model to rank 5 7694.27734375

You can see the first three GPUs it doesn’t really care. But after that RAM seems to fill up in nearly equal amounts to the model size.

Sending random tiny tensor copies to all gpus will still increase ram usage by the same ~600mb.

It seems like for every GPU there is additional cuda initialization overhead.

However, after sending tensors to all GPUs, sending models to the GPUs still increases CPU RAM but only by ~400mb each.

Btw I changed this to actually copy the models there:

models.append(copy.deepcopy(model).cuda(torch.device(rank)))

Is this expected behaviour?
I can’t keep trying to debug if I don’t know if it’s a bug. It’s costing me a lot of time.
@ptrblck (I’ve seen a mod be tagged before so I assume it’s ok)

Im assuming its just expected behavior for large models.

This seems unexpected that CPU memory is increasing, could you please file an issue at Issues · pytorch/pytorch · GitHub so our CUDA experts can help take a look? Thank you!

It seems to be from using huggingface models, which should implement nn.Module.
Using my own nn.Module does not furtherly increase usage.