"No space left on device" with DistributedDataParallel despite space left

dstrohmaier · July 21, 2023, 10:27am

Python=3.11.4 | packaged by conda-forge
pytorch=2.0.1
pytorch-cuda=11.7

I use DistributedDataParallel for training and testing a model and getting the following error

Traceback (most recent call last):
File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
   obj = _ForkingPickler.dumps(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/local/scratch/ds858/nlsa/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
    metadata = storage._share_filename_cpu_()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A similar error has been reported here: Unable to write to file </torch_18692_1954506624>
I am, however, not using Docker or anything like it. The shared memory should also be more than sufficient. It has been checked repeatedly.
The num_workers of the dataloader should also be 0, as it is left at its default value. (There are, however, multiple processes running on the GPUs at the same time.)

Having a brief glance at reductions.py, I found that the line above at which it had a problem is specific to the file_system sharing strategy. As a test, I switched to the file_descriptor strategy. This led to another error, which you can find below.

File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/queues.py", line 122, in get
  return _ForkingPickler.loads(res)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/local/scratch/ds858/nlsa/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 307, in rebuild_storage_fd
   fd = df.detach()
        ^^^^^^^^^^^
 File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/resource_sharer.py", line 57, in detach
   with _resource_sharer.get_connection(self._id) as conn:
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/resource_sharer.py", line 86, in get_connection
   c = Client(address, authkey=process.current_process().authkey)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 507, in Client
   answer_challenge(c, authkey)
 File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 756, in answer_challenge
   response = connection.recv_bytes(256)        # reject large message
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 215, in recv_bytes
   buf = self._recv_bytes(maxlength)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes
   buf = self._recv(4)
         ^^^^^^^^^^^^^
 File "/local/scratch/ds858/nlsa/lib/python3.11/multiprocessing/connection.py", line 382, in _recv
   raise EOFError
OFError

I am grateful for any help or advice for addressing this issue.

kumpera · July 21, 2023, 2:31pm

Can you share the full error message and a repro script.

From your description it might be that your system has too little shm setup. You can check the value of kernel.shmmax by using /sbin/sysctl -a

dstrohmaier · July 21, 2023, 3:11pm

I cannot share the whole code at the moment. Generally, though, I am using huggingface BERT models. The error occurred when I was creating embeddings using the model and passing them back to the main process using a queue. I am not even training the model at this point. It is set to eval() and I am using a no_grad() context. The error also has been intermittent.

We have checked the value of kernel.shmmax and it should far exceed what my process could possibly ever use. Speficially, it is a kernel.shmmax value of 18,446,744,073,692,774,399 or 0xFFFF FFFF FEFF FFFF = 2^64 - 2^24 - 1, which we assume is the default value for a 64bit machine.

I have not seen the same error on another machine I am using, however, so it might be some other bit of machine-specific configuration.