I use as
def add_outlinks(arr, source):
for dest in arr:
if int(dest) in _local_dict:
_local_dict[dest].in_links.append(int(source))
rpc.init_rpc(my_name, rank=rank, world_size=size,rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(num_send_recv_threads=16,rpc_timeout=datetime.timedelta(seconds=10000))) # initial_rpc
#CALL rpc TO OTHER RANKS
if rank==0:
print("add-link...")
try:
array_rpc = list(range(0, size))
count=0
for it in _local_dict:
count = count+1
arr_send = []
for i in range(0, size):
arr_send.append([])
u = _local_dict[it]
source = u.vertexId
for i in u.links:
arr_send[int(i) % size].append(int(i))
for i in array_rpc:
my_target = "worker" + str(i)
if len(arr_send[i])>0:
rpc.rpc_async(my_target, add_outlinks, args=(arr_send[i],source))
except:
print("rank ",rank," run ",count,"/",len(_local_dict))
rpc.api._wait_all_workers()
print("shutdown.... rpc... ", rank)
rpc.api._wait_all_workers()
rpc.shutdown()
- arr_send[i] will send to rank i
For elements in _local_dict, we can run parallel.
1. After 10000s, how many objects are processed?
–> the outputs are as below. The worker0 is not see in the output. I try to print “count” value. But there is no output for “count” variable.
....
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:45462
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:44970
....
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:19635
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:27553
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:44970
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:47501: Connection reset by peer
Traceback (most recent call last):
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "pagerank.py", line 380, in init_process
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:44931: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
...
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [2001:700:4a01:10::38]:22942: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 240, in shutdown
_wait_all_workers()
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:5848: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:50095: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:29331
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:57022: Connection reset by peer
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 165, in _wait_all_workers
args=(sequence_id, self_worker_name,),
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:15236
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:2720
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:23214
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:38547
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:50607
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
...
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:12173
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 554, in rpc_sync
return fut.wait()
RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:57022: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:3715
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:2693
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
During handling of the above exception, another exception occurred:
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:9877
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:3715
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
Traceback (most recent call last):
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [2001:700:4a01:10::38]:22942: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:36780: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:47501: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:44931: Connection reset by peer
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:3213: Connection reset by peer
Traceback (most recent call last):
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:13723
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "pagerank.py", line 380, in init_process
print("shutdown.... rpc... ", rank)
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:9140
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 240, in shutdown
_wait_all_workers()
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 165, in _wait_all_workers
args=(sequence_id, self_worker_name,),
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 77, in wrapper
return func(*args, **kwargs)
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 554, in rpc_sync
return fut.wait()
RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [2001:700:4a01:10::38]:9889: Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
File "/cluster/home/cnphuong/.conda/envs/Pytorch_ENV/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
KeyboardInterrupt
[E thread_pool.cpp:112] Exception in thread pool task: Application timeout caused pair closure
[E thread_pool.cpp:112] Exception in thread pool task: [/opt/conda/conda-bld/pytorch_1587428228634/work/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [2001:700:4a01:10::38]:34716
I changed the code to print count as
for it in _local_dict:
if(rank==0:)
count = count+1
print(count)
–> all elements in _local_dict is run. However, the program stopped by timeout.
2. Is distributed autograd/optimizer also used in this case?
–> Not yet. It is at https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html ??
3. Are any of those CPUs locate on the same machine? (so that shm can be helpful)
–> I use 32 CPUs on this machine. (All of my CPU on the machine.)
Thanks