How about using the DataParallel? You can split the training batches across multiple GPUs.
I would recommend first finding the smallest batch size that you can use with only 1GPU. For example, if the smallest batch you can run on a single GPU is N=8, then utilizing 4 GPUs allows you to increase the batch size N=32.
No there is no point of doing that if you only have 1 GPU. In one of your options you were tlaking about multiple GPUs, so I assumed you have access to multiple GPUs.