How to fix a SIGSEGV in pytorch when using distributed training (e.g. DDP)?

I keep getting this issue when running DDP in pytorch:

Traceback (most recent call last):
  File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 330, in <module>
    main_distributed()
  File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 230, in main_distributed
    mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
    raise Exception(
Exception: process 1 terminated with signal SIGSEGV

but this error is rather uninformative (dones’t tell me what process or what it was trying to access for example) so I am unsure what I need to do do solve it.

Some research takes you that:

SIGSEGV: On a Unix operating system such as Linux, a “segmentation violation” (also known as “signal 11”, “SIGSEGV”, “segmentation fault” or, abbreviated, “sig11” or “segfault”) is a signal sent by the kernel to a process when the system has detected that the process was attempting to access a memory address that does not belong to it. Typically, this results in the offending process being terminated.

yes I do have multiprocessing code as the usual mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size) requires.

First I read the docs on sharing strategies which talks about how tensors are shared in pytorch:

Note that it applies only to CPU tensor - CUDA tensors will always use the CUDA API, as that’s the only way they can be shared.

I was using the file system sharing memory since it seemed to give me less issue when I needed lots of processes but I went down to only 2 processes and 2 gpus and to the share strategy being file descriptor. I thought that perhaps if the processes had their own cached file descriptor then there wouldn’t be issues.

I did check the cuda devices availabe:

 $ echo $CUDA_VISIBLE_DEVICES
1,3

all seems fine.

I am unsure what might be causing the issue. There are possible issues like:

  • two processes are trying to checkpoint at the same time but I always only let rank=0 do the checkpointing so that doesn’t make sense.
  • two processes are writing to tensorboard but I also only allow rank=0 to do the logging (or any of the printing).

So I am unsure what could be causing the issue. It could be that I have my dataset concatenated all 1 single json file causing the issue, but that wasn’t causing issues yesterday with multiple gpus…though, if that is the case it would be hard to fix since DDP (distributed data parallel) uses the DistributedSampler which doesn’t place any restriction like that on my data-set or dataloaders…or at least as far as I know (afaik).

Last thing is that yesterday I was getting weird error too and somehow it occurred to me to check the gpu type. I was quetting an issue because I was using a k40 gpu. I made sure that was not the case. Yesterday I was using a Quadro 6000 RTX, today it seems these are the GPUs I got:

$ nvidia-smi
Tue Mar  2 12:15:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:02:00.0 Off |                  N/A |
| 22%   37C    P0    56W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 00000000:03:00.0 Off |                  N/A |
| 24%   39C    P0    56W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    Off  | 00000000:82:00.0 Off |                  N/A |
| 53%   84C    P2   244W / 250W |  11935MiB / 12196MiB |     57%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    Off  | 00000000:83:00.0 Off |                  N/A |
| 25%   39C    P0    56W / 250W |      0MiB / 12196MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    2   N/A  N/A     31809      C   python                          11933MiB |
+-----------------------------------------------------------------------------+

not sure if that is causing the issue but it’s not always realistic to get the Quadro’s so I want it to work for the Titan’s too (and anything that isn’t the k40s since the k40s seem to not be supported by pytorch anymore).

There are a few pytorch discussion forum posts and gitissues but none seems very helpful (to me at lest - not clear what they did to solve things despite end of discussion):


crossposted:

4 Likes

Hey @Brando_Miranda,

I have a very similar if not the same issue (difficult to say). Have you found a solution to this problem? In my case this issue also occurs rather infrequently. Running on the same server (same GPUs, environment, etc.) training my model sometimes is successful and sometimes ends with SIGSEGV.

Cheers

Edit: If it is of any help, I posted my code here.

1 Like

Hi Dsethcz. I have not been able to solve the weird memory errors. However, I noticed that it happened only at the end of my training script (i.e. once destroying the dist group happened). So my model in theory seemed to have trained to the end and the issue is something when shutting down the distributed set up. From skimming your posts it seems your having the same behaviour. Did you just try checkpointing your model or doing whatever you had to do up to the point you have to destroy the dist group?

For me I just collect all the process with some wait call and then allow them to crash (so that the same error occurs) and it seems to give the same error. I think it must be something with my server because in the dgx I have access I don’t think I’ve ever happened. I recommend to ignore it if you fall in my scenario.

best of luck!

Hmm that is not an optimal solution…

Have you tested if the trains correctly until the crash occurs (by for example comparing to single GPU or DataParallel training)?

single gpu works fine. Dgx machine works fine. I can’t see a pattern on which gpu is crashing on me.

I don’t use DataParallel so no.

Yea I know it’s suboptimal but sometimes due to the laws of diminishing returns the last tiny gain (which is just that my script doesn’t print an errort) isn’t worth the (already days/weeks of effort) I put into solving it. It’s also not really my job to solve pytorch not work. So I am happy with my current solution as long as it trains to completion as it is now :slight_smile:

Not really a solution, but a workaround that seems to work: I downgraded from python 3.9.1 to 3.8.7. A more detailed report can be found here.

I also play around with version of stuff. Python 3.8 pytorch 1.7.1 and cuda 10.2 (but have driver of at leat 11.0) seemed to work consistently for me.

Glad your python version change worked.

I had the same problem and solution. I downgraded python to 3.8.7 and to PyTorch 1.7. I was using torchrun and ddp in PyTorch 1.10, but torchrun doesn’t work w PyTorch 1.7 so I had to stop using torchrun and use torch.distributed.launch instead. Now it works smoothly and no sigsegv errors.

This worked for me

Apparantly you need to downgrade your python version to 3.7 and pytorch version to 1.7.1. If you are using cuda 11.0, then upgrade to 1.7.1+cu110

Replacing mp.spawn with the start and join methods can solve this problem. The issue is likely caused by a faulty implementation of spawn in PyTorch, which leads to incorrect mapping of shared memory between processes. Using start and join avoids this problem and prevents segmentation faults.

# mp.spawn(run, args=(world_size, q), nprocs=world_size, join=True)
children = []
for i in range(world_size):
    subproc = mp.Process(target=run, args=(i, world_size, q))
    children.append(subproc)
    subproc.start()

for i in range(world_size):
    children[i].join()

With modifications made like this, the code should work properly.

4 Likes

Even with this modification my training didn’t start. See my code below:

import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import (
    DistributedSampler
)
from torch.distributed import init_process_group, destroy_process_group

def ddp_setup(rank: int, world_size: int):
        """
        Args:
            Rank: Unique identifier of each process
            world_size: Total number of process
        """
        os.environ["MASTER_ADDR"] = "localhost"
        os.environ["MASTER_PORT"] = "3000"  

        init_process_group(backend="nccl", rank=rank, world_size=world_size)

def prepare_dataloader(
        trainset: Dataset,
        bs: int,
    ) -> DataLoader:
        sampler_train = DistributedSampler(trainset)
        
        trainloader = DataLoader(
            trainset, 
            batch_size=bs, 
            pin_memory=True,
            shuffle=False, 
            sampler=sampler_train
        )
        return trainloader

class Trainer:
    def __init__(
        self,
        model: AutoModelForUniversalSegmentation,
        train_data: DataLoader,
        optimizer: torch.optim,
        gpu_id: int,
        save_every: int
    ) -> None:
        self.gpu_id = gpu_id
        self.model = model.to(gpu_id)
        self.train_data = train_data
        self.optimizer = optimizer
        self.save_every = save_every
        self.model = DDP(self.model, device_ids=[self.gpu_id])

    def _run_epoch(self, epoch):
        b_sz = len(next(iter(self.train_data))[0])
        print(f"[GPU{self.gpu_id}] Epoch {epoch} | Batchsize: {b_sz}")

        for i, batch in enumerate(self.train_data):
            # zero the parameter gradients
            self.optimizer.zero_grad()

            batch = {k:v.to(self.gpu_id) for k,v in batch.items()}
            outputs = model(**batch)
            
            # backward pass + optimize
            loss = outputs.loss
            loss.backward()
            self.optimizer.step()   

    def _save_checkpoint(self, epoch):
        ckp = self.model.module.state_dict()
        torch.save(ckp, f"Checkpoint_{epoch}.pt")
        print(f"Epoch {epoch} | Training checkpoint save at checkpoint_{epoch}.pt")

    def train(self, epochs:int):
        for epoch in range(epochs):
            self._run_epoch(epoch)
            if self.gpu_id == 0 and epoch % self.save_every == 0:
                self._save_checkpoint(epoch)

from torch.optim import AdamW

def main(rank: int, world_size:int, epochs:int, save_every:int):
    torch.cuda.empty_cache()
    ddp_setup(rank, world_size)
    adamw = AdamW(model.parameters(), lr=1e-3)
    train_data = prepare_dataloader(dataset, bs=16)
    trainer = Trainer(model, train_data, adamw, rank, save_every)
    trainer.train(epochs)
    destroy_process_group()

world_size = torch.cuda.device_count()
total_epochs = 300
save_every = 10
children = []
#mp.spawn(main, args=(world_size, total_epochs, save_every,), nprocs=world_size)

for i in range(world_size):
    subproc = mp.Process(target=main, args=(i, world_size, total_epochs, save_every))
    children.append(subproc)
    subproc.start()

for i in range(world_size):
    children[i].join()