Hi,
I’m trying to implement Population Based Training on GPU. Here I call multiple processes using torch.multiprocessing where each process trains the model with different hyperparameters (learning rate in my example). At regular intervals the accuracy is calculated and each process saves its model and optimizer parameters onto a shared memory space. This memory is managed by torch.multiprocessing.Manager().dict(). Here’s the initialization code:
if __name__ == "__main__":
try:
set_start_method('spawn')
except RuntimeError:
pass
train_state_dict = mp.Manager().dict()
val_acc_dict = mp.Manager().dict()
net_acc_dict = mp.Manager().dict()
print(torch.cuda.device_count())
processes = []
for rank in range(4):
learning_rate = [0.01, 0.06, 0.001, 0.008]
p = mp.Process(target=training_cifar_multi, \
args = (train_state_dict, val_acc_dict, net_acc_dict ,rank, \
return_top_arg, learning_rate[rank]))
p.start()
processes.append(p)
for p in processes:
p.join()
If any processes’s model’s accuracy is not in the top 20%, it, like any normal human being, caves in to societal pressure and copies the model and optimizer parameter from one of the models in the top 20%, and then tweaks the copied hyperparameters by a small amount to avoid being caught (jokes apart, that’s what it actually does, copy everything and then perturb the hyperparameters).
Here’s how model parameters are saved in any of the processes:
train_state_dict[name] = {'state_dict': model.state_dict(), 'optimizer':
optimizer.state_dict(), 'epoch':epoch}
Here’s how model parameters are loaded if the model in underperforming. Flag is the name of the process which is performing in the top 20%:
flag = return_top_arg(val_acc_dict, valid_accuracy)
if flag:
model.load_state_dict(train_state_dict[flag]['state_dict'])
optimizer.load_state_dict(train_state_dict[flag]['optimizer'])
epoch = train_state_dict[flag]['epoch']
for param_group in optimizer.param_groups:
param_group['lr'] = (np.random.uniform(0.5,2,1)[0])*param_group['lr']
However, I run into this error when the process tires to load the model:
Traceback (most recent call last):
File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/usr/PBT/cifar_10.py", line 93, in training_cifar_multi
model.load_state_dict(train_state_dict[flag]['state_dict'])
File "<string>", line 2, in __getitem__
File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/managers.py", line 772, in _callmethod
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/managers.py", line 283, in serve_client
send(msg)
File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/usr/anaconda2/envs/py36/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/usr/anaconda2/envs/py36/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 104, in reduce_storage
metadata = storage._share_cuda_()
RuntimeError: invalid device pointer: 0x1020ec00000 at /opt/conda/conda-bld/pytorch_1501971235237/work/pytorch-0.1.12/torch/lib/THC/THCCachingAllocator.cpp:211
I’m not sure how to debug this. Can someone please help me out here?