Hey @shaoming20798, are G_AB and G_BA the two large models you referred to? Will it work if you put your G_AB and G_BA models on two different GPUs, move the computed loss into the same GPU, then compute loss_G on that GPU and run backward from there?
BTW, for distributed training discussions, please consider adding a “distributed” tag. People working on distributed training will actively checking that category.