Save model for distributeddataparallel

hi, i am new to distributeddataparallel, but i just find almost all the example code show in pytorch save the rank 0 model, i just want to know do we need to save all the model if we do not sync bn parameters in our model ? so, each rank seems to have different model, if bn parameters is not sync. but we often use all the rank for inference. so, if we want to get the sample inference result as we use all the gpu for inference , should we save all the model of each rank and load all the model then inference ? hope help! thanks very much !!!

Yes, that’s correct. Either you only run inference on the model on rank 0, or you explicitly replicate the BN stats from rank 0 to the other ranks prior to inference. The model weights should be identical across processes and only the BN stats should be different.

i found that only use the rank 0 model trained with distributeddataparallel to inference on val dataset , performance is not as good as use all the model trained with distributeddataparallel to inference on val dataset, only use rank 0 model usually get 0.5% to 1% accuracy slower in classification task. so do we need to allreduce the model in the training , for example, we can allreduce the model begin or after every epoch? hope for your detailed kind reply . Thanks very much.

How do you do validation with a model wrapped with DistributedDataParallel? I imagine you call .eval() everywhere and have every worker validate a subset of your validation set, followed by allreduce of the accuracies. Is this correct?

Another piece of information that might be useful here: if you construct DistributedDataParallel with broadcast_buffers=True then all ranks will receive a copy of the registered buffers prior to executing a forward pass. With respect to batch normalization this means they all use the running stats from rank 0.

2 Likes

Yes, i use every rank to validate a subset of my validation set, it is a default setting in distributed distributeddataparallel when we define the dataset loader. Anyhow it is not an important point.
As the default setting of broadcast_buffers is True. so batch normalization is only calculated in rank 0 and every other rank shared the same batch normalization statistic in rank 0.
but in fact accuracy is different. (1)when i only save the model in rank 0 and all the other ranks load rank0’s model to do inference accuracy is 75%. (2) each rank save its own model and load its own model to do inference accuracy can be 75.8%.
since model.eval() will freeze the bn parameter.
i think the most difference might be : (1) each rank’s model is not same. they share different model, such as bn parameters at least.
(2)Furthermore, in the backward process, different bn parameters might cause the gradient calculated different in every layer, i am not sure whether the gradient is calculated in every each rank or they only broadcast the gradient calculated in rank 0.
so, what is the truth ? Thank you very much.

The only difference I can imagine is that the BN parameters don’t get replicated from rank 0 to the other ranks after the final call to backward, since they are frozen for evaluation after that. I highly doubt that can be the cause of a 0.8% accuracy difference though. What is your approach for aggregating the accuracy numbers?

Regarding (2): the gradient is computed in every process and then averaged across them. This means that if you start with identically initialized model weights and use the same optimizer in all processes, the optimizer will take the same step everywhere.

Thanks for your kind reply.

(1)Yes, i am sure that get 0.8% accuracy difference, if you run the pytorch / example code for imagenet, you will get the same truth, that difference definitely exists, some time the difference is small 0.1%, some time you may get 0.3% or larger difference between accuracy. Difference seed and num_worker might cause different gap between the two method i am not sure for the truth.

(2)By the way, i am not able to solve the reproducible problem for pytorch with distributeddataparallel. i have try to follow the setting with cudnn.benchmark=False cudnn.deterministic=True, and set the torch seed and cuda seed for each rank and dataloader worker_init_fn ( numpy seed and random.seed ) for dataloader and fix the num_workers to a fixed number every time runing my code, but result always different. how can we reproduce our own experiments with the same setting.

(3)As for the accuracy, each rank save the index for error file and write to a json file. at last i summary all the json file to get the accuracy. this is an accurate method. i have also follow the dist.all_reduce method to get a similar result. in both case, difference quite exists and some time quite be unexpected.
so, i do really think sync bn parameters is quite important for some tasks, when we can not shared a big batch size.

(4) As for the difference between our model in each rank , should we all_reduce all model for each rank right after each epoch, so bn parameter get all_reduce for average. this may not cause too much time when we run in multi-node.

Regarding the random seed, do you also call torch.manual_seed()? There are a few that you have to initialize unfortunately.

Regarding the BN stats sync. If you use DDP with broadcast_buffers=True then it will replicate the BN statistics before every forward pass and make sure all ranks have a copy of the stats used by rank 0. The models must be identical after this step happens. You could try to confirm this by passing an identical input to the model and verify they produce an identical output.

1 Like

so, how can i make sure each rank share the same model when i use the DDP.
to be specific, do i need to set the torch seed to be same for each rank, so the initialized model will be same for each rank at the beginning?
it seems that each rank display different output .

another problem is how can i make the experiments reproducible ? thanks very much.

The initialized model is broadcast from rank 0 to the other processes from the DDP constructor, so the initialized parameters will be identical when you first call forward.

Numerical reproducibility depends on determinism on the inputs/transforms, deterministic partitioning of the dataset, as well as deterministic execution of your model (e.g. there are modes where cuDNN does not execute deterministically). If at any of these you have some non-determinism, numerical reproducibility is impossible.

Hey all,
Why don’t you take a look at SyncBatchNorm.

From the link: For a network, already initialized, that has any BatchNormalization layers you can do :
sync_bn_network = torch.nn.utils.convert_sync_batchnorm(network, process_group)

1 Like

Thanks a lot, since this new layer in pytorch1.1 operation is quite slow when apply to multi-node distributed training, so i do not plan to do so. speed can slow down to 2 times longer when i training on 4 node with each node 4 gpus. multi-node gpu communication is an bottleneck in my case.

oh! Thats interesting. I’m using the SyncBatchNorm layer currently in single node 8GPU training.

Did you try to implement your own BN synchronization code? I see that you had an idea here

Did you implement synchronization? Also, were you successful in getting your accuracy up?

Also, did you test the above? If you did, what were the results?