Multi-GPU torch.load() out of memory

tcsn_wty · April 11, 2021, 9:10am

I was training a model with 1 GPU device and just now figured out how to train with 2 GPU devices. Therefore I paused the training and resume after adding in lines of code to use 2 GPUs. Firstly, loading the checkpoint would cause torch.load() out of memory no matter I use 1 GPU or 2 GPUs. Then I followed some posts to first load the check point to CPU and delete checkpoint, and the model was able to resume on 1 GPU.

checkpoint = torch.load('checkpoint/ckpt.t7', map_location=torch.device("cpu"))
model.load_state_dict(checkpoint['state'])
del checkpoint
torch.cuda.empty_cache()

However, when I tried to use 2 GPUs it caused a cuda out of memory error. I wasn’t able to find any related posts specifically regarding OOM regarding multi-GPUs especially it worked just fine with single GPU. I wonder if there are any tricks playing with multi-GPU when loading checkpoint?