Multiprocessing with cuda model

Hello,

I am trying to test out my own Alpha Zero implementation (I know hardware is gonna limit me hard) and for the self play part I would like to use all of my CPU cores to accelerate the logic. I am moving the whole model to CPU before entering the multiprocessing, however I still get CUDA errors about shared memory and I am quite unsure how to adress this.

To control the device usage, I am setting a global torch device, that controls what is supposed to be used (automatically sends all relevant tensors.to(GLOBAL_DEVICE)).

My code reads:

Inside the NNetWrapper class:

class NNetWrapper:
    def __init__(self, nnet, game_dim, action_dim):
        self.nnet = nnet
        self.board_x, self.board_y = game_dim, game_dim
        self.action_size = action_dim

    def to_device(self):
        if GLOBAL_DEVICE.type == 'cpu':
            self.nnet.cpu()
        else:
            self.nnet.cuda()
        return
....

And in the coach class, that does the self play:

                curr_device = deepcopy(GLOBAL_DEVICE)
                GLOBAL_DEVICE = torch.device('cpu')
                self.nnet.to_device()

                if multiprocess:
                    pbar = tqdm(total=self.num_episodes)
                    pbar.set_description('Creating self-play training turns')
                    with ProcessPoolExecutor(max_workers=cpu_count()) as executor:
                        game_states = []
                        for _ in range(self.num_episodes):
                            self.game.reset()
                            game_states.append(deepcopy(self.game.state))
                        futures = list(
                            (executor.submit(self.exec_ep,
                                             mcts=MCTS(deepcopy(self.nnet), num_mcts_sims=self.mcts_sims),
                                             reset_game=True,
                                             state=game_states[i])
                             for i in range(self.num_episodes)))
                        for _ in as_completed(futures):
                            pbar.update(1)

However I get the following errors:

Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\lib\multiprocessing\queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "C:\ProgramData\Miniconda3\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "C:\ProgramData\Miniconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 213, in reduce_tensor
    (device, handle, storage_size_bytes, storage_offset_bytes) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at c:\a\w\1\s\tmp_conda_3.6_173528\conda\conda-bld\pytorch_1549561085620\work\torch\csrc\generic\StorageSharing.cpp:232

Why am I seeing CUDA errors when I send the model previously to CPU? Am I not doing enough for this? On a computer without CUDA this code runs just fine.

PS: I know I am supposed to use the torch.multiprocessing package, however I am rather new to multiprocessing and cant quite figure out an exact way to replicate the PoolProcessExecutor pipe. I’d be happy about tips to that end as well.

Thanks in advance,
Michael