Loading model parameters in a multiprocess evironment

barrel-roll · December 20, 2017, 6:25am

Hi,

I’m trying to implement Population Based Training on GPU. Here I call multiple processes using torch.multiprocessing where each process trains the model with different hyperparameters (learning rate in my example). At regular intervals the accuracy is calculated and each process saves its model and optimizer parameters onto a shared memory space. This memory is managed by torch.multiprocessing.Manager().dict(). Here’s the initialization code:

if __name__ == "__main__":
    try:
        set_start_method('spawn')
    except RuntimeError:
        pass
    train_state_dict = mp.Manager().dict()
    val_acc_dict = mp.Manager().dict()
    net_acc_dict = mp.Manager().dict()
    print(torch.cuda.device_count())
    processes = []
    for rank in range(4):
        learning_rate = [0.01, 0.06, 0.001, 0.008]
        p = mp.Process(target=training_cifar_multi, \
            args = (train_state_dict, val_acc_dict, net_acc_dict ,rank, \
                    return_top_arg, learning_rate[rank]))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

If any processes’s model’s accuracy is not in the top 20%, it, like any normal human being, caves in to societal pressure and copies the model and optimizer parameter from one of the models in the top 20%, and then tweaks the copied hyperparameters by a small amount to avoid being caught (jokes apart, that’s what it actually does, copy everything and then perturb the hyperparameters).

Here’s how model parameters are saved in any of the processes:

train_state_dict[name] = {'state_dict': model.state_dict(), 'optimizer': 
                         optimizer.state_dict(), 'epoch':epoch}

Here’s how model parameters are loaded if the model in underperforming. Flag is the name of the process which is performing in the top 20%:

    flag = return_top_arg(val_acc_dict, valid_accuracy)
    if flag:
        model.load_state_dict(train_state_dict[flag]['state_dict'])
        optimizer.load_state_dict(train_state_dict[flag]['optimizer'])
        epoch = train_state_dict[flag]['epoch']
        for param_group in optimizer.param_groups:
            param_group['lr'] = (np.random.uniform(0.5,2,1)[0])*param_group['lr']

However, I run into this error when the process tires to load the model:

Traceback (most recent call last):
  File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/usr/PBT/cifar_10.py", line 93, in training_cifar_multi
    model.load_state_dict(train_state_dict[flag]['state_dict'])
  File "<string>", line 2, in __getitem__
  File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/managers.py", line 283, in serve_client
    send(msg)
  File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/usr/anaconda2/envs/py36/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 104, in reduce_storage
    metadata = storage._share_cuda_()
RuntimeError: invalid device pointer: 0x1020ec00000 at /opt/conda/conda-bld/pytorch_1501971235237/work/pytorch-0.1.12/torch/lib/THC/THCCachingAllocator.cpp:211

I’m not sure how to debug this. Can someone please help me out here?

Han_Zheng · August 26, 2018, 11:48am

hey, have you solved this problem? I met this same problem too. Thank you.