I already find the reason.
I use multiprocessing and DistributedDataParallel in my code rewrited as https://github.com/pytorch/examples/blob/master/imagenet/main.py.
When I get the pth file, I print some BN layers’ weight , I found it is exactly a same value.
That not happend when I use single gpu to train.
Someone can help me??