I am running the torch.distributed.pipeline.sync.Pipe library using pytorch 3.8.1 (also tried nightly). I have 2 visible devices. Below is the example from doc.
import torch import torch.nn as nn import torch.nn.functional as F from torch.distributed.pipeline.sync import Pipe from torchgpipe import GPipe # Run with Pipe fc1 = nn.Linear(16, 8).cuda(0) fc2 = nn.Linear(8, 4).cuda(1) model = nn.Sequential(fc1, fc2) model = Pipe(model, chunks=8) input = torch.rand(16, 16).cuda(0) output_rref = model(input) # Run with GPipe fc1 = nn.Linear(16, 8) fc2 = nn.Linear(8, 4) model = nn.Sequential(fc1, fc2) model = GPipe(model, balance=[1,1], chunks=8) model = nn.DataParallel(model) input = torch.rand(16, 16).cuda(0) output_rref = model(input) print(output_rref)
I am getting this error:
Traceback (most recent call last): File "test.py", line 12, in <module> output_rref = model(input) File "/usr0/home/ruohongz/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/usr0/home/ruohongz/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/distributed/pipeline/sync/pipe.py", line 366, in forward return RRef(output) RuntimeError: agent INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/distributed/rpc/rpc_agent.cpp":247, please report a bug to PyTorch. Current RPC agent is not set!
However, the GPipe code works fine. What is the problem with the pytorch assertion?