Cuda running out of memory when trying to fit two GPT2 Medium models in one GPU?

I need to train two GPT2-medium models simultaneously such that output of one needs to be used by other and so on recursively.
I unfortunately cannot fit both models on a single GPU along with the gradient calculations during training.

I can think of two ways to solve this:

  1. Split the model itself into several smaller parts and put them into multiple GPUs and train.
  2. Put the two GPT-2 on two different GPUs and train them.

However, I do not know the feasibility of both the options and its affect on the performance of the model.
Is it possible to implement both the options on pytorch? If not, can you suggest me other ways to do so?

Thank you in advance.