Map_location() takes multiple devices

I save a model from cuda:0 (model1) and load it to cuda:1 (model2). After I delete the tensor on model1, the cuda:0 is still occupied. Why is that? The following is the snippet to follow.

# Define model
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc = nn.Linear(10, 10)

    def forward(self, x):
        x = self.fc(x)
        return x

optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

def torch_save(model, optimizer):{'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'epoch': 0
                }, '')

def torch_load(optimizer):
    model = TheModelClass()

    checkpoint = torch.load('', map_location={'cuda:0': 'cuda:1'})
    epoch = checkpoint['epoch']

    print(f'epoch = {epoch}')

    return model

model = TheModelClass()

model1 ='cuda')
print(f'Device of model = {model.fc.weight.device}')

torch_save(model1, optimizer)
model2 = torch_load(optimizer)
del model, model1

# Expected to have only cuda:1 has tensors,
# but both cuda:0 and cuda:1 have tensors.
# This is wired.

The first CUDA operation on a device will create the CUDA context (which contains the PyTorch kernels, cudnn, NCCL etc.). While you are able to delete all tensors, you are not able to release the CUDA context during the script lifetime.

1 Like

Hi, thank you for your kind reply. I thought torch.cuda.empty_cache() will release the context. But I found something even more wired: if I just load the model2 (without loading model1). Two gpus are still used.

torch.cuda.empty_cache() will only remove all reserved and unused memory, which the custom caching allocator in PyTorch holds to avoid expensive memory allocations. The CUDA context won’t be freed.

If I understand the map_location argument with a dict, it would remap the tensors from the key location to the value location, so both GPUs would be initially used (which should thus also create a context on both of them).
To avoid it you could store the state_dict on the CPU and map it to the desired GPU afterwards.

Hi, thank you for your help and patience. May I ask if it works in DistributedDataParallel as well? Since the doc reads

Besides, when loading the module, you need to provide an appropriate map_location argument to prevent a process to step into others’ devices. If map_location is missing, torch.load will first load the module to CPU and then copy each parameter to where it was saved, which would result in all processes on the same machine using the same set of devices.

You should be able to load a CPU state_dict to each device using DDP.
However, if the original state_dict was stored on a specific GPU, you might run into the mentioned issue.
Let me know, if this would be the case, as I haven’t seen it yet.