Hi, I have a questions about NVIDIA apex
I know NVIDIA apex package creates each process per gpu, like this
so, each process are referred as local_rank variable in my code
I want to save best accuracy from each process and i coding like below
When i Using 2 gpus
for epochs in range(0, args.epoch):
train()
test()
...
save_best()
def save_best():
# 1'th gpu
if args.local_rank == 0:
is_best = test_acc > best_acc
best_acc = max(test_acc, best_acc)
if is_best:
torch.save(...)
# 2'th gpu
if args.local_rank == 1:
is_best = test_acc > best_acc
best_acc = max(test_acc, best_acc)
if is_best:
torch.save(...)
After 1 epoch I can verify each accuracy
0’th gpu’s accuracy is 19.906, It is saved 0’th weight file
1’th gpu’s accuracy is 19.269, It is saved 1’th weight file
But, When i loading weight file and adapt to network, test accuracy is not equal to each result
I got 19.572(0’th file), 19.561(1’th file)
Surprisingly, When i using 1 gpu for training, the situation that i mentioned above is not happened(test accuracy while training is equal to accuracy which is loading from weight file)
I can’t understand why this situation is happened.
Any body can help?