Pipeline Parallelism (Pipe) Assertion Fail

Rifle_Zhang · April 19, 2021, 9:32pm

I am running the torch.distributed.pipeline.sync.Pipe library using pytorch 3.8.1 (also tried nightly). I have 2 visible devices. Below is the example from doc.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributed.pipeline.sync import Pipe
from torchgpipe import GPipe

# Run with Pipe
fc1 = nn.Linear(16, 8).cuda(0)
fc2 = nn.Linear(8, 4).cuda(1)
model = nn.Sequential(fc1, fc2)
model = Pipe(model, chunks=8)
input = torch.rand(16, 16).cuda(0)
output_rref = model(input)

# Run with GPipe
fc1 = nn.Linear(16, 8)
fc2 = nn.Linear(8, 4)
model = nn.Sequential(fc1, fc2)
model = GPipe(model, balance=[1,1], chunks=8)
model = nn.DataParallel(model)
input = torch.rand(16, 16).cuda(0)
output_rref = model(input)
print(output_rref)

I am getting this error:

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    output_rref = model(input)
  File "/usr0/home/ruohongz/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr0/home/ruohongz/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/distributed/pipeline/sync/pipe.py", line 366, in forward
    return RRef(output)
RuntimeError: agent INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/distributed/rpc/rpc_agent.cpp":247, please report a bug to PyTorch. Current RPC agent is not set!

However, the GPipe code works fine. What is the problem with the pytorch assertion?

pritamdamania87 · April 21, 2021, 11:17pm

You need to initialize the RPC framework, see the latest master docs: Pipeline Parallelism — PyTorch master documentation