I’m experimenting with a synchronous version of gossip-based SGD described in this paper on the MNIST dataset: https://arxiv.org/pdf/1611.04581.pdf
For this, I’m using point-to-point communication with the TCP backend, which involves frequent transfer of parameters between GPU <=> CPU. I hadn’t encountered this issue with my implementation of all-reduce based SGD, using torch.distributed.all_reduce either with TCP or with Gloo, and is why I suspect this is because of point-to-point communication (isend/irecv).
Regarding environment, I’m running this with Python 3.6 / CUDA-8 / Ubuntu 16.04 on a GCE n1-standard-4 that has 4 cores / 15GB RAM / 32GB SSD / one K-80.
I’ll try reproducing this with more granular logging, and will provide a code sample once I can narrow it down.
Meanwhile, I’m curious if this has been encountered before and under what circumstances.
Here’s a snippet showing what happen’s with “gossip”-based parameter aggregation:
for param in model.parameters():
# A container to hold async requests
# ---- MARKER A ----
requests = []
# We've to exchange params with peers
to_send = param.data
if args.gpu:
to_send = to_send.cpu()
receive_buffers = [
(peer, torch.zeros(param.data.shape))
for peer in peers
]
for peer, buffer in receive_buffers:
requests.append(dist.isend(
tensor=to_send,
dst=peer
))
requests.append(dist.irecv(
tensor=buffer,
src=peer
))
# Wait for all the requests to complete
for r in requests:
r.wait()
# ---- MARKER B ----
# Then compute the average
for _, buffer in receive_buffers:
if args.gpu:
buffer = buffer.cuda(device_id)
param.data += buffer
param.data /= len(receive_buffers) + 1
# ---- MARKER C ----
This is called after every x updates to the parameters. Based on logs, the error seems to occur between ---- MARKER A ---- and ---- MARKER B ----consistently. (Edit: I just observed this happen between ---- MARKER B ---- and ---- MARKER C ----)
If you’re curious about the model architecture (this is for the MNIST dataset):
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.l1 = nn.Linear(28 * 28, 1024)
self.l2 = nn.Linear(1024, 1024)
self.l3 = nn.Linear(1024, 1024)
self.out = nn.Linear(1024, 10)
for name, param in self.named_parameters():
if 'weight' in name:
nn.init.kaiming_normal(param)
def forward(self, x):
# input
x = x.view(-1, 28 * 28)
# reshaped
x = F.dropout(x, p=0.2, training=self.training)
x = self.l1(x)
x = F.relu(x)
# l1
x = F.dropout(x, p=0.5, training=self.training)
x = self.l2(x)
x = F.relu(x)
# l2
x = F.dropout(x, p=0.5, training=self.training)
x = self.l3(x)
x = F.relu(x)
# l3
x = F.dropout(x, p=0.5, training=self.training)
x = self.out(x)
x = F.log_softmax(x, dim=1)
# output
return x
Please let me know if you’d like more information.
The more I look at the logs, the more this seems related to CPU <=> GPU communication, and mostly incidental that it’s around the same places that distributed communication occurs.
I don’t have a full python 3.5 install here sorry.
I guess to debug this, I would remove features one by one until it does not crash anymore. Possibly forcing some cuda synchronization before doing the distributed communications?
Curious if you were ever able to figure this out. I’m getting your error and an additional one when using Nvidia’s APEX library for mixed precision: 'std::out_of_range' what(): vector::_M_range_check: __n (which is 671091489) >= this->size() (which is 3). I’m training a 3d network for segmentation in pytorch and due to GPU memory constraints, I am placing all activation computations in GPU1 and loss computation in CPU. GPU0 takes care of the rest.
@sid, yes, that is what I am doing as well… Thank you @ptrblck, the implementation of amp into pytorch works really well. It almost seems faster… I’m not sure if the error mentioned above will go away, but again, pytorch’s amp new feature is nice to have.
terminate called after throwing an instance of 'std::system_error'
what(): Transport endpoint is not connected
/home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "main_moco_cln.py", line 591, in <module>
main()
File "main_moco_cln.py", line 197, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/aniruddha/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/aniruddha/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 0 terminated with signal SIGABRT
/home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
len(cache))
/home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
len(cache))
(base) aniruddha:moco$ /home/aniruddha/anaconda3/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean u
p at shutdown
len(cache))
I’m getting this same error using different models in Detectron2 (DeepLab and now MaskFormer). Does anyone have a solution to this other than checkpointing and re-trying? Maybe a different communication backend or something?
terminate called after throwing an instance of 'std::system_error'
what(): Transport endpoint is not connected
Traceback (most recent call last):
File "/train_net.py", line 264, in <module>
launch(
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch mp.spawn(
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join():
File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 130, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT