I keep getting this issue when running DDP in pytorch:
Traceback (most recent call last):
File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 330, in <module>
main_distributed()
File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 230, in main_distributed
mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
raise Exception(
Exception: process 1 terminated with signal SIGSEGV
but this error is rather uninformative (dones’t tell me what process or what it was trying to access for example) so I am unsure what I need to do do solve it.
Some research takes you that:
SIGSEGV: On a Unix operating system such as Linux, a “segmentation violation” (also known as “signal 11”, “SIGSEGV”, “segmentation fault” or, abbreviated, “sig11” or “segfault”) is a signal sent by the kernel to a process when the system has detected that the process was attempting to access a memory address that does not belong to it. Typically, this results in the offending process being terminated.
yes I do have multiprocessing code as the usual mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
requires.
First I read the docs on sharing strategies which talks about how tensors are shared in pytorch:
Note that it applies only to CPU tensor - CUDA tensors will always use the CUDA API, as that’s the only way they can be shared.
I was using the file system sharing memory since it seemed to give me less issue when I needed lots of processes but I went down to only 2 processes and 2 gpus and to the share strategy being file descriptor. I thought that perhaps if the processes had their own cached file descriptor then there wouldn’t be issues.
I did check the cuda devices availabe:
$ echo $CUDA_VISIBLE_DEVICES
1,3
all seems fine.
I am unsure what might be causing the issue. There are possible issues like:
- two processes are trying to checkpoint at the same time but I always only let
rank=0
do the checkpointing so that doesn’t make sense. - two processes are writing to tensorboard but I also only allow
rank=0
to do the logging (or any of the printing).
So I am unsure what could be causing the issue. It could be that I have my dataset concatenated all 1 single json file causing the issue, but that wasn’t causing issues yesterday with multiple gpus…though, if that is the case it would be hard to fix since DDP (distributed data parallel) uses the DistributedSampler which doesn’t place any restriction like that on my data-set or dataloaders…or at least as far as I know (afaik).
Last thing is that yesterday I was getting weird error too and somehow it occurred to me to check the gpu type. I was quetting an issue because I was using a k40 gpu. I made sure that was not the case. Yesterday I was using a Quadro 6000 RTX
, today it seems these are the GPUs I got:
$ nvidia-smi
Tue Mar 2 12:15:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:02:00.0 Off | N/A |
| 22% 37C P0 56W / 250W | 0MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A |
| 24% 39C P0 56W / 250W | 0MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 00000000:82:00.0 Off | N/A |
| 53% 84C P2 244W / 250W | 11935MiB / 12196MiB | 57% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 00000000:83:00.0 Off | N/A |
| 25% 39C P0 56W / 250W | 0MiB / 12196MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 2 N/A N/A 31809 C python 11933MiB |
+-----------------------------------------------------------------------------+
not sure if that is causing the issue but it’s not always realistic to get the Quadro’s so I want it to work for the Titan’s too (and anything that isn’t the k40s since the k40s seem to not be supported by pytorch anymore).
There are a few pytorch discussion forum posts and gitissues but none seems very helpful (to me at lest - not clear what they did to solve things despite end of discussion):
-
Multiprocessing using torch.multiprocessing - #3 by Brando_Miranda
-
Using torch.Tensor over multiprocessing.Queue + Process fails - #12 by eacousineau
-
Process 3 terminated with signal SIGSEGV · Issue #1720 · facebookresearch/fairseq · GitHub
crossposted: