Hello!
I implemented the rpc based parameter server framework according to this link, but encountered some problems.
If the neural network being trained does not contain batchnorm layers, then both training and testing can yield ideal results. However, if the neural network contains batchnorm layers, such as resnet, it is normal during the training phase, but during the testing phase, i.e. after model.eval(), the test accuracy of the model is always 0.1. My test code is as follows.
def get_accuracy(test_loader, model):
model.eval()
correct_sum = 0
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
with torch.no_grad():
for i, (data, target) in enumerate(test_loader):
data, target = data.to(device), target.to(device)
out = model(data)
pred = out.argmax(dim=1, keepdim=True)
pred = pred.to(device)
correct = pred.eq(target.view_as(pred)).sum().item()
correct_sum += correct
print(f"Accuracy {correct_sum / len(test_loader.dataset)}")
After encountering this problem, I checked some information. The reason might be the running_mean and running_var of the batchnorm layer, but I don’t know what to do with them. I am very confused, can you give me some advice, thank you!