OOM problem when use distributed training

I encountered this problem in my recent work. It’ ok when I train my model with distribute type from scratch, and all the gpus have the same memory consuming. However , if I load the pre-tained model , the master gpu has much larger memory consuming than others and then it causes OOM problem.

How can I solve this problem?