`terminate called after throwing an instance of 'std::system_error'`

sid · February 9, 2018, 2:16pm

I’m experimenting with a synchronous version of gossip-based SGD described in this paper on the MNIST dataset: https://arxiv.org/pdf/1611.04581.pdf

For this, I’m using point-to-point communication with the TCP backend, which involves frequent transfer of parameters between GPU <=> CPU. I hadn’t encountered this issue with my implementation of all-reduce based SGD, using torch.distributed.all_reduce either with TCP or with Gloo, and is why I suspect this is because of point-to-point communication (isend/irecv).

Regarding environment, I’m running this with Python 3.6 / CUDA-8 / Ubuntu 16.04 on a GCE n1-standard-4 that has 4 cores / 15GB RAM / 32GB SSD / one K-80.

I’ll try reproducing this with more granular logging, and will provide a code sample once I can narrow it down.

Meanwhile, I’m curious if this has been encountered before and under what circumstances.