Long(real long) training time on bert finde-tuning

Hi, I am training a dataset with 700,000 samples. Basically they are just text with a binary label. What I am doing is

model = transformer.BertModel.from_pretrained(“uncased_bert”)
outputs = model(ids=ids, mask=mask, token_type_ids=token_type_ids)
loss = loss_fun(outputs, targets).

So for each piece of text, I used encode_plus(text) to get ids, mask, etc .
I am using a batch size of 32 and learning rate is lr=3e-5(smaller is better by some research) find this is taking really really long. Like a few days if I calculate correctly. I also found that using 2 gpus, each of them will be cost like 6G memory and using 1 gpu, it will take 11G. Is there anything that I can do to speed this up? Thanks!