Hello,
I am trying to test out my own Alpha Zero implementation (I know hardware is gonna limit me hard) and for the self play part I would like to use all of my CPU cores to accelerate the logic. I am moving the whole model to CPU before entering the multiprocessing, however I still get CUDA errors about shared memory and I am quite unsure how to adress this.
To control the device usage, I am setting a global torch device, that controls what is supposed to be used (automatically sends all relevant tensors.to(GLOBAL_DEVICE)).
My code reads:
Inside the NNetWrapper class:
class NNetWrapper:
def __init__(self, nnet, game_dim, action_dim):
self.nnet = nnet
self.board_x, self.board_y = game_dim, game_dim
self.action_size = action_dim
def to_device(self):
if GLOBAL_DEVICE.type == 'cpu':
self.nnet.cpu()
else:
self.nnet.cuda()
return
....
And in the coach class, that does the self play:
curr_device = deepcopy(GLOBAL_DEVICE)
GLOBAL_DEVICE = torch.device('cpu')
self.nnet.to_device()
if multiprocess:
pbar = tqdm(total=self.num_episodes)
pbar.set_description('Creating self-play training turns')
with ProcessPoolExecutor(max_workers=cpu_count()) as executor:
game_states = []
for _ in range(self.num_episodes):
self.game.reset()
game_states.append(deepcopy(self.game.state))
futures = list(
(executor.submit(self.exec_ep,
mcts=MCTS(deepcopy(self.nnet), num_mcts_sims=self.mcts_sims),
reset_game=True,
state=game_states[i])
for i in range(self.num_episodes)))
for _ in as_completed(futures):
pbar.update(1)
However I get the following errors:
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\lib\multiprocessing\queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "C:\ProgramData\Miniconda3\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "C:\ProgramData\Miniconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 213, in reduce_tensor
(device, handle, storage_size_bytes, storage_offset_bytes) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at c:\a\w\1\s\tmp_conda_3.6_173528\conda\conda-bld\pytorch_1549561085620\work\torch\csrc\generic\StorageSharing.cpp:232
Why am I seeing CUDA errors when I send the model previously to CPU? Am I not doing enough for this? On a computer without CUDA this code runs just fine.
PS: I know I am supposed to use the torch.multiprocessing package, however I am rather new to multiprocessing and cant quite figure out an exact way to replicate the PoolProcessExecutor pipe. I’d be happy about tips to that end as well.
Thanks in advance,
Michael