How to train MLM model XLM Roberta large on google machine specs fastly with less memory

I am fine tuning masked language model from XLM Roberta large on google machine specs.
I made couple of experiments and was strange to see few results.

"a2-highgpu-4g" ,accelerator_count=4, accelerator_type="NVIDIA_TESLA_A100" on 4,12,672 data batch size 4 Running ( 4 data*4 GPU=16 data points)
"a2-highgpu-4g" ,accelerator_count=4 , accelerator_type="NVIDIA_TESLA_A100"on 4,12,672 data batch size 8 failed
 "a2-highgpu-4g" ,accelerator_count=4, accelerator_type="NVIDIA_TESLA_A100" on 4,12,672 data batch size 16 failed
"a2-highgpu-4g" ,accelerator_count=4.,accelerator_type="NVIDIA_TESLA_A100" on 4,12,672 data batch size 32 failed

I was not able to train model with batch size more than 4 on # of GPU’s. It stopped in mid-way.

Here is the code I am using.

training_args = tr.TrainingArguments(
#     disable_tqdm=True,
    output_dir='/home/pc/Bert_multilingual_exp_TCM/results_mlm_exp2', 
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=4,
#     per_device_train_batch_size
#     per_gpu_train_batch_size
    prediction_loss_only=True
    ,save_strategy="no"
    ,run_name="MLM_Exp1"
    ,learning_rate=2e-5
    ,logging_dir='/home/pc/Bert_multilingual_exp_TCM/logs_mlm_exp1'        # directory for storing logs
    ,logging_steps=40000
    ,logging_strategy='no'
)

trainer = tr.Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data
    
)

My Questions

How can I train with larger batch size on a2-highgpu-4g machine?

Which parameters can I include in TrainingArguments so that training is fast and occupies small memory?

Thanks in advance.

Versions

torch==1.11.0+cu113 

torchvision==0.12.0+cu113  

torchaudio==0.11.0+cu113 

transformers==4.17.0