Backprop Using Model Parallelism

I’ve distributed a very large transformer model across several GPUs and gotten it to generate predictions. It can do forward passes, but it gets hung up in backprop.

loss.backward()

c:\programdata\anaconda3\envs\context2\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    196                 products. Defaults to ``False``.
    197         """
--> 198         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    199 
    200     def register_hook(self, hook):

c:\programdata\anaconda3\envs\context2\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     98     Variable._execution_engine.run_backward(
     99         tensors, grad_tensors, retain_graph, create_graph,
--> 100         allow_unreachable=True)  # allow_unreachable flag
    101 
    102 

RuntimeError: expected device cuda:3 but got device cuda:0 (compute_types at ..\aten\src\ATen\native\TensorIterator.cpp:246)
(no backtrace available)

Moving the weights to the last GPU seems like it defeats the purpose of model parallelism. What’s the right approach to do this? It definitely won’t be able to train if I do it in a data parallel sort of way.

Could you post a minimal, executable code snippet so that we could reproduce this issue and debug it, please? :slight_smile:

If someone else runs into this challenge and you don’t have a specific solution in mind, the best thing to do seems to be to carefully follow each and every tensor. In my case, I was using a transformer model whose wte layer was called on the lm_head, meaning that the lm_head had to (counter intuitively) be placed on the first gpu.