Can I parallelize model in multiple gpu?


I am currently stuck in a problem where I have multiple gpus with 11G memory, but the model and optimizer I use is pretty huge and complicated. So even set the batch size == 1, it will still yield Cuda Out of Memory.

What can I do to figure this out?

Can I put different parts of the model in different gpu? Because I notice that most of the memory consumption consists of model parameters, grads and optimizer inner state.

After load the model to gpu , it consumes 3G memory. But I used 2 optimizer for different part of the model. After the first loss.backward() and optimizer.step() with batch_size == 1 ,the single gpu is almost fully occupied.

Or how can I restore the gpu memory to 3G after first optimization? I tried to delete the loss and zero_grad the model, but it still does not work : (

Looking forward to your reply!

Yes, you could use model sharding and push different parts of the model (parameters) to specific devices via to('cuda:id') and use the same operation in the forward to push the activation to the right device.

PyTorch Lightning provides a beta of sharded training, which might be interesting for you.
CC @williamFalcon for more information :slight_smile:

Thx! I will check it later.

And I noticed you once said in another topic that torch.utils.checkpoint may help when Cuda out of memory with batch size equals to 1. Does that fit to my situation? Appreciate so much !

Yes, torch.utils.checkpoint would trade compute for memory and could thus lower the memory usage.