Multiprocessing with cuda model


I am trying to test out my own Alpha Zero implementation (I know hardware is gonna limit me hard) and for the self play part I would like to use all of my CPU cores to accelerate the logic. I am moving the whole model to CPU before entering the multiprocessing, however I still get CUDA errors about shared memory and I am quite unsure how to adress this.

To control the device usage, I am setting a global torch device, that controls what is supposed to be used (automatically sends all relevant

My code reads:

Inside the NNetWrapper class:

class NNetWrapper:
    def __init__(self, nnet, game_dim, action_dim):
        self.nnet = nnet
        self.board_x, self.board_y = game_dim, game_dim
        self.action_size = action_dim

    def to_device(self):
        if GLOBAL_DEVICE.type == 'cpu':

And in the coach class, that does the self play:

                curr_device = deepcopy(GLOBAL_DEVICE)
                GLOBAL_DEVICE = torch.device('cpu')

                if multiprocess:
                    pbar = tqdm(total=self.num_episodes)
                    pbar.set_description('Creating self-play training turns')
                    with ProcessPoolExecutor(max_workers=cpu_count()) as executor:
                        game_states = []
                        for _ in range(self.num_episodes):
                        futures = list(
                                             mcts=MCTS(deepcopy(self.nnet), num_mcts_sims=self.mcts_sims),
                             for i in range(self.num_episodes)))
                        for _ in as_completed(futures):

However I get the following errors:

Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\lib\multiprocessing\", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "C:\ProgramData\Miniconda3\lib\multiprocessing\", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "C:\ProgramData\Miniconda3\lib\site-packages\torch\multiprocessing\", line 213, in reduce_tensor
    (device, handle, storage_size_bytes, storage_offset_bytes) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at c:\a\w\1\s\tmp_conda_3.6_173528\conda\conda-bld\pytorch_1549561085620\work\torch\csrc\generic\StorageSharing.cpp:232

Why am I seeing CUDA errors when I send the model previously to CPU? Am I not doing enough for this? On a computer without CUDA this code runs just fine.

PS: I know I am supposed to use the torch.multiprocessing package, however I am rather new to multiprocessing and cant quite figure out an exact way to replicate the PoolProcessExecutor pipe. I’d be happy about tips to that end as well.

Thanks in advance,