Single-gpu trained model used in multi-gpus setting

Hi,

I am trying to finetune a sigle-gpu trained model in multi-gpus. First, I specify
CUDA_VISIBLE_DEVICES=0,1,2. Then I warp the defined model with torch.nn.DataParallel() and use the rmsprop optimizer as follows:

model = torch.nn.DataParallel(model).cuda()
optimizer = torch.optim.RMSprop(model.parameters(), lr=opt.lr, alpha=0.99,
eps=1e-8, momentum=0, weight_decay=0)

This code works well if I train a model in multi-gpus from scratch. However, if I start from a checkpoint of a single-gpu trained model, when it runs to code

optimizer.step()

an error shows “…/python2.7/site-packages/torch/optim/rmsprop.py”, line 52, in step state = self.state[p]
KeyError: Parameter containing:
( 0 , 0 ,.,.) =1.00000e-02 * 2.5088
( 0 , 1 ,.,.) = 1.00000e-02 * 1.6257
.
.
.
(127,126,.,.) = 1.00000e-02 *2.5302
(127,127,.,.) = 1.00000e-02 *-4.7111
[torch.cuda.FloatTensor of size 128x128x1x1 (GPU 0)]"

Does anyone know what’s the problem here? Thanks in advance!

What’s your code to load the model? Try doing something like this:

model = MyNetwork()
model.load_state_dict(path_to_file)
model = torch.nn.DataParallel(model).cuda()
optimizer = torch.optim.RMSprop(model.parameters(), lr=opt.lr, alpha=0.99, eps=1e-8, momentum=0, weight_decay=0)

i.e. load the model before constructing the DataParallel and create the optimizer after creating the data parallel.

1 Like

Thanks for your reply! @colesbury

I tried what said. It works. However, Here comes another related problem. I also need to use the optimizer state saved in the previous single-gpu training rather than create a new optimizer.

optimizer.load_state_dict(checkpoint['optimizer'])

When I do this, the same error occurs. I guess something inside the saved optimizer is not consistent with the multi-gpus setting. Any suggestion to solve this? Thanks.

1 Like

Any update on this matter?

Try move model to gpu first, then create optimizer, loading it’s parameters from checkpoint