For example, I have a big module “BigNet”, and two gpus. Each gpu’s memory can only allow me to train one module. I know I can do this
net1=BigNet().to(gpu1) # optimizer 1
net2=BigNet().to(gpu2) # optimizer 2
X=X.to(gpu1)
y=net1(X)
y=net2(y.to(gpu2))
loss=loss_fn(y, label)
loss.backward()
optim1.step()
optim2.step()
By this way, I can train a bigger model with two modules. gpu1 is idle when gpu2 is operating. Can I back up the mem of gpu1 to CPU mem at this time, train the net3 on gpu1, and copy net1 from memory to GPU when the loss is back-propagated to net1? For example:
net1=BigNet().to(gpu1) # optimizer 1
net2=BigNet().to(gpu2) # optimizer 2
net3=BigNet() # optimizer 3
X=X.to(gpu1)
y=net1(X)
y=net2(y.to(gpu2))
net1.to(cpu)
net2.to(gpu1)
y=net3(y.to(gpu1))
loss=loss_fn(y, label)
loss.backward(Control) # control loss backward at net3, net2
optim3.step()
optim2.step()
net3.to(cpu)
net1.to(gpu1)
loss.backward(Control) # control loss backword at net1
optim1.step()
Thanks