`terminate called after throwing an instance of 'std::system_error'`

sid · February 9, 2018, 4:14am

I am experimenting with some point-to-point distributed communication while training. This appeared after several epochs:

terminate called after throwing an instance of 'std::system_error'
   what():  Invalid argument

Any thoughts on what might have caused this?

albanD · February 9, 2018, 10:14am

Could you give more context on what your are doing when this happens, and if possible a small code sample that reproduce the issue please?

sid · February 9, 2018, 2:16pm

I’m experimenting with a synchronous version of gossip-based SGD described in this paper on the MNIST dataset: https://arxiv.org/pdf/1611.04581.pdf

For this, I’m using point-to-point communication with the TCP backend, which involves frequent transfer of parameters between GPU <=> CPU. I hadn’t encountered this issue with my implementation of all-reduce based SGD, using torch.distributed.all_reduce either with TCP or with Gloo, and is why I suspect this is because of point-to-point communication (isend/irecv).

Regarding environment, I’m running this with Python 3.6 / CUDA-8 / Ubuntu 16.04 on a GCE n1-standard-4 that has 4 cores / 15GB RAM / 32GB SSD / one K-80.

I’ll try reproducing this with more granular logging, and will provide a code sample once I can narrow it down.

Meanwhile, I’m curious if this has been encountered before and under what circumstances.

sid · February 11, 2018, 8:44pm

Here’s a snippet showing what happen’s with “gossip”-based parameter aggregation:

for param in model.parameters():
    # A container to hold async requests
    # ---- MARKER A ----
    requests = []

    # We've to exchange params with peers
    to_send = param.data
    if args.gpu:
        to_send = to_send.cpu()

    receive_buffers = [
        (peer, torch.zeros(param.data.shape))
        for peer in peers
    ]

    for peer, buffer in receive_buffers:
        requests.append(dist.isend(
            tensor=to_send,
            dst=peer
        ))
        requests.append(dist.irecv(
            tensor=buffer,
            src=peer
        ))

    # Wait for all the requests to complete
    for r in requests:
        r.wait()
    # ---- MARKER B ----

    # Then compute the average
    for _, buffer in receive_buffers:
        if args.gpu:
            buffer = buffer.cuda(device_id)
        param.data += buffer
    param.data /= len(receive_buffers) + 1

    # ---- MARKER C ----

This is called after every x updates to the parameters. Based on logs, the error seems to occur between ---- MARKER A ---- and ---- MARKER B ---- ~~consistently~~. (Edit: I just observed this happen between ---- MARKER B ---- and ---- MARKER C ----)

If you’re curious about the model architecture (this is for the MNIST dataset):

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.l1 = nn.Linear(28 * 28, 1024)
        self.l2 = nn.Linear(1024, 1024)
        self.l3 = nn.Linear(1024, 1024)
        self.out = nn.Linear(1024, 10)
        for name, param in self.named_parameters():
            if 'weight' in name:
                nn.init.kaiming_normal(param)

    def forward(self, x):
        # input
        x = x.view(-1, 28 * 28)
        # reshaped
        x = F.dropout(x, p=0.2, training=self.training)
        x = self.l1(x)
        x = F.relu(x)
        # l1
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.l2(x)
        x = F.relu(x)
        # l2
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.l3(x)
        x = F.relu(x)
        # l3
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.out(x)
        x = F.log_softmax(x, dim=1)
        # output
        return x

Please let me know if you’d like more information.

sid · February 12, 2018, 12:36am

I created a gist that should help reproduce this, it generally occurs within the 10th epoch.

gist.github.com

https://gist.github.com/anonymous/a5b4282a46be7e2971baf690e8cde054

invalid_argument_error.py

import os
import torch
import torch.distributed as dist
from torch import multiprocessing, nn, randperm
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
from torch.autograd import Variable
from torch.optim import SGD
from torchvision import datasets, transforms

This file has been truncated. show original

sid · February 12, 2018, 5:06am

The more I look at the logs, the more this seems related to CPU <=> GPU communication, and mostly incidental that it’s around the same places that distributed communication occurs.

sid · February 14, 2018, 3:45pm

Curious if there are any thoughts on this, suggestions for alternatives to pursue, etc.

albanD · February 15, 2018, 10:36am

Hi,

I don’t have a full python 3.5 install here sorry.
I guess to debug this, I would remove features one by one until it does not crash anymore. Possibly forcing some cuda synchronization before doing the distributed communications?

jpcenteno · April 2, 2020, 10:37pm

Curious if you were ever able to figure this out. I’m getting your error and an additional one when using Nvidia’s APEX library for mixed precision: 'std::out_of_range' what(): vector::_M_range_check: __n (which is 671091489) >= this->size() (which is 3). I’m training a 3d network for segmentation in pytorch and due to GPU memory constraints, I am placing all activation computations in GPU1 and loss computation in CPU. GPU0 takes care of the rest.

ptrblck · April 4, 2020, 6:07am

I’m not sure, what is causing this error, but I would recommend to try out native amp as described here without the need to build apex.

sid · April 4, 2020, 4:07pm

I found no solution to my issue. The workaround was to checkpoint aggressively and resume from checkpoint.

jpcenteno · April 5, 2020, 4:43am

@sid, yes, that is what I am doing as well… Thank you @ptrblck, the implementation of amp into pytorch works really well. It almost seems faster… I’m not sure if the error mentioned above will go away, but again, pytorch’s amp new feature is nice to have.

ani0075 · May 14, 2020, 2:56pm

This might be related.

While running main_moco.py in https://github.com/facebookresearch/moco, I keep getting this after a few epochs.

terminate called after throwing an instance of 'std::system_error'
  what():  Transport endpoint is not connected
/home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "main_moco_cln.py", line 591, in <module>
    main()
  File "main_moco_cln.py", line 197, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/aniruddha/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/aniruddha/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGABRT
/home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
  len(cache))
/home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
  len(cache))
(base) aniruddha:moco$ /home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean u
p at shutdown
  len(cache))

https://github.com/facebookresearch/moco/issues/25 says “Transport endpoint is not connected means your network is probably unreliable.”

@sid @jpcenteno any thoughts? Is this because of instability in my network communication?

Using Python 3.7 and Pytorch 1.4.0

collinmccarthy · August 27, 2021, 5:04am

I’m getting this same error using different models in Detectron2 (DeepLab and now MaskFormer). Does anyone have a solution to this other than checkpointing and re-trying? Maybe a different communication backend or something?

terminate called after throwing an instance of 'std::system_error'
what():  Transport endpoint is not connected
Traceback (most recent call last):                                                                 
File "/train_net.py", line 264, in <module>                          
    launch(                                                                                        
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch                                                                           mp.spawn(                                                                                      
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn                                                                    
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')                   
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes                                                              while not context.join():                                                                      
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 130, in join                                                                     
    raise ProcessExitedException(                                                                
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

veast · April 28, 2022, 1:23pm

hello Aniruddha, I met the same problem, have you resolved or made progress?

Antonia_Calvi · July 22, 2022, 2:17pm

If you found out I’d like to hear too!