Figured it out for whoever is interested:
when training a model using DataParallel, in order to load the state_dict onto a model running on the CPU, you must save the model params using torch.save(model.module.state_dict(), PATH).
if you saved your model using torch.save(model.state_dict(), PATH), then when loading the weights into the model, you must first send your model to multiple GPU’s using the DataParallel method, and only then load the state dict.