Cuda out of memory when restart from checkpoint

I know this is a pretty old topic. I am using model parallel with DDP: the model is partitioned into four parts, each residing on a GPU. When I restart training from a checkpoint, it gives Cuda out of memory error on GPU1 (largest model partition). I am using AdamW as optimizer and gradient checkpointing. The loading process looks like below:

checkpoint = torch.load('checkpoint.pth', map_location='cpu')
model = MyModel()
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])

model.to_device(device_list) #distribute model to 4 GPU
model = DDP(model, device_ids=None, output_device=None)

del checkpoint
torch.cuda.empty_cache()

Tried different methods online but nothing worked.