I’ve distributed a very large transformer model across several GPUs and gotten it to generate predictions. It can do forward passes, but it gets hung up in backprop.
loss.backward()
c:\programdata\anaconda3\envs\context2\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
196 products. Defaults to ``False``.
197 """
--> 198 torch.autograd.backward(self, gradient, retain_graph, create_graph)
199
200 def register_hook(self, hook):
c:\programdata\anaconda3\envs\context2\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
98 Variable._execution_engine.run_backward(
99 tensors, grad_tensors, retain_graph, create_graph,
--> 100 allow_unreachable=True) # allow_unreachable flag
101
102
RuntimeError: expected device cuda:3 but got device cuda:0 (compute_types at ..\aten\src\ATen\native\TensorIterator.cpp:246)
(no backtrace available)
Moving the weights to the last GPU seems like it defeats the purpose of model parallelism. What’s the right approach to do this? It definitely won’t be able to train if I do it in a data parallel sort of way.