Cuda invalid device pointer

dylandjian · April 8, 2018, 2:57pm

Hi !

While doing some multiprocessing, I ran into this error :

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 241, in _feed
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/dylan/.virtualenvs/testGo/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 104, in reduce_storage
    metadata = storage._share_cuda_()
RuntimeError: invalid device pointer: 0x204bc6a00 at /home/dylan/Desktop/superGo/pytorch/aten/src/THC/THCCachingAllocator.cpp:259

The context of this error is the following :
I launched my training in another process (works fine), initializing the model (player) and training a deepcopy of it (new_player) on the same newly launched process. At some point during training I want to asynchronously launch a new process to evaluate the models against each other like this :

pool = MyPool(1)
pool.apply_async(evaluate, args=(player, new_player,), callback=new_agent)

(MyPool is an extension of the class Pool from multiprocessing.Pool with the daemon set to False)

It seems that the parameters of the models can’t get copied to the new process for some reason!
Any idea on how to fix this ?

I’m using Python 3.5.2 and PyTorch from source version 0.4.0a0+d93d41b
Thanks !

dylandjian · April 9, 2018, 9:09am

So I managed to reproduce the error following this code :

gist.github.com

https://gist.github.com/dylandjian/05d872c6d3d74e80c04bb70187090c3e

test.py

import multiprocessing
import multiprocessing.pool
import torch


class NoDaemonProcess(multiprocessing.Process):
    # make 'daemon' attribute always return False
    def _get_daemon(self):
        return False
    def _set_daemon(self, value):

This file has been truncated. show original

Also, I didn’t see that at first, but the copy is correctly sent to the second new_process but fails to get copied to the third ?

acgtyrant · August 8, 2018, 3:21am

Is it solved now? Is it solved now? Is it solved now? Sorry for repeated questions.