I am running the torch.distributed.pipeline.sync.Pipe library using pytorch 3.8.1 (also tried nightly). I have 2 visible devices. Below is the example from doc.
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributed.pipeline.sync import Pipe
from torchgpipe import GPipe
# Run with Pipe
fc1 = nn.Linear(16, 8).cuda(0)
fc2 = nn.Linear(8, 4).cuda(1)
model = nn.Sequential(fc1, fc2)
model = Pipe(model, chunks=8)
input = torch.rand(16, 16).cuda(0)
output_rref = model(input)
# Run with GPipe
fc1 = nn.Linear(16, 8)
fc2 = nn.Linear(8, 4)
model = nn.Sequential(fc1, fc2)
model = GPipe(model, balance=[1,1], chunks=8)
model = nn.DataParallel(model)
input = torch.rand(16, 16).cuda(0)
output_rref = model(input)
print(output_rref)
I am getting this error:
Traceback (most recent call last):
File "test.py", line 12, in <module>
output_rref = model(input)
File "/usr0/home/ruohongz/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr0/home/ruohongz/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/distributed/pipeline/sync/pipe.py", line 366, in forward
return RRef(output)
RuntimeError: agent INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/distributed/rpc/rpc_agent.cpp":247, please report a bug to PyTorch. Current RPC agent is not set!
However, the GPipe code works fine. What is the problem with the pytorch assertion?