A GPipe implementation in PyTorch

Kakao Brain announces torchgpipe, an implementation of GPipe in PyTorch as a handy library.

from torchgpipe import GPipe
model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)
output = model(input)

GPipe is a scalable pipeline parallelism library published by Google Brain. It leverages the training of a giant model which requires much memory. For instance, Google trained AmoebaNet-B with 557M parameters over GPipe.

1 Like

Hi, I’m trying to use torchgpipe on some other models. But the training time increased with GPipe. And I can’t reproduce the paper’s result with torchgpipe’s example resnet101. I think I might measure the training time in a wrong way. How did you measure the training time of resnet101 on GPipe? Thanks ahead!

I would expect the training time to take a hit, because you’re moving much more data around compared to a direct forward/backward. All of that overhead will come at a performance penalty. If I understand correctly, pipelining with this approach is best suited for allowing extremely large models to train if you have limited memory available.

Hi, thank you for your question.

There are some conditions to optimize a model with GPipe:

  • The model requires a large amount of memory.
  • The original batch size is not so small. Because we need a micro-batch which is not too small. If a micro-batch is too small, GPU wouldn’t be utilized.
  • Well balanced. The imbalance between partitions makes GPipe underutilized.

I just published my ResNet-101 performance benchmark. If you have the same environment with me, I expect you get the same result. Even you don’t have the same environment, the code will be helpful. Especially, you can check the balance of ResNet-101 what I’ve used, and how I measured the training time.

Yeah, I totally agreed. I guess this approach only suits some models. It’s just in the paper they achieved really great result on AmoebaNet, in terms of both memory usage and training time, which made me doubt my result.

Thanks a lot for your expaination! My GPU memory is too small for a large batch size which limits my tests. I guess the main purpose of GPipe is to enable us to train extremely large model not to accelerate the training process.

I noticed that in the your experiments, pipeline methods use different batch size. Isn’t this going to affect the model performance?

Good question. Yes, it affects. My experiment reproduces a performance benchmark in the original paper. The benchmark also uses adjusted batch size to maximize throughput regardless of the model accuracy.

In our experiments, we mea- sured the effects of pipeline parallelism and recomputation on the model throughput of ResNet-101 and AmoebaNet-D (4, 512). We fixed the image size at 224 × 224. We adjusted the mini-batch size to maximize the throughput.

I see, thank you so muck!

1 Like

I just released v0.0.2 of torchgpipe with the detailed documentation. This version includes the automatic balancing.