I am trying to test out my own Alpha Zero implementation (I know hardware is gonna limit me hard) and for the self play part I would like to use all of my CPU cores to accelerate the logic. I am moving the whole model to CPU before entering the multiprocessing, however I still get CUDA errors about shared memory and I am quite unsure how to adress this.
To control the device usage, I am setting a global torch device, that controls what is supposed to be used (automatically sends all relevant tensors.to(GLOBAL_DEVICE)).
My code reads:
Inside the NNetWrapper class:
class NNetWrapper: def __init__(self, nnet, game_dim, action_dim): self.nnet = nnet self.board_x, self.board_y = game_dim, game_dim self.action_size = action_dim def to_device(self): if GLOBAL_DEVICE.type == 'cpu': self.nnet.cpu() else: self.nnet.cuda() return ....
And in the coach class, that does the self play:
curr_device = deepcopy(GLOBAL_DEVICE) GLOBAL_DEVICE = torch.device('cpu') self.nnet.to_device() if multiprocess: pbar = tqdm(total=self.num_episodes) pbar.set_description('Creating self-play training turns') with ProcessPoolExecutor(max_workers=cpu_count()) as executor: game_states =  for _ in range(self.num_episodes): self.game.reset() game_states.append(deepcopy(self.game.state)) futures = list( (executor.submit(self.exec_ep, mcts=MCTS(deepcopy(self.nnet), num_mcts_sims=self.mcts_sims), reset_game=True, state=game_states[i]) for i in range(self.num_episodes))) for _ in as_completed(futures): pbar.update(1)
However I get the following errors:
Traceback (most recent call last): File "C:\ProgramData\Miniconda3\lib\multiprocessing\queues.py", line 234, in _feed obj = _ForkingPickler.dumps(obj) File "C:\ProgramData\Miniconda3\lib\multiprocessing\reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "C:\ProgramData\Miniconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 213, in reduce_tensor (device, handle, storage_size_bytes, storage_offset_bytes) = storage._share_cuda_() RuntimeError: cuda runtime error (71) : operation not supported at c:\a\w\1\s\tmp_conda_3.6_173528\conda\conda-bld\pytorch_1549561085620\work\torch\csrc\generic\StorageSharing.cpp:232
Why am I seeing CUDA errors when I send the model previously to CPU? Am I not doing enough for this? On a computer without CUDA this code runs just fine.
PS: I know I am supposed to use the torch.multiprocessing package, however I am rather new to multiprocessing and cant quite figure out an exact way to replicate the PoolProcessExecutor pipe. I’d be happy about tips to that end as well.
Thanks in advance,