Somewhat related to this answer.
I don’t know if you are planning to train the model or just to use it for inference.
For training you could try to trade compute for memory via torch.utils.checkpoint
or CPU-offloading as described in this post.
For inference you should wrap the model in a no_grad()
context to save memory. If this is still running out of memory, you might need to use multiple GPUs and apply model sharding.