I am trying to train over a custom parameter server, but it checks all the boxes for setting weights and updating gradients… but for some reason it won’t improve accuracy over my data.
I’m using redisai as my database to host all my training/test data and global and all my worker weights/grads. I know the opinion of using an external database is a performance hit, but this is something I want to do for my own experience.
The PS holds the global model that the workers read from to their local model and update before pushing gradients back to the PS to update the global model.
PS:
Setting weights:
for k,v in model.state_dict().items():
conn.tensorset(k,v.cpu().numpy())
Getting gradients:
msd = model.state_dict()
for name, param in model.named_parameters():
msd[name].grad = conn.tensorget(f'{name}_grad')
model.load_state_dict(msd)
optimizer.step()
optimizer.zero_grad()
Worker:
Getting weights:
lmsd = model.state_dict()
for k,v in model.state_dict().items():
lmsd[name].data.copy_(conn.tensorget(f'{name}_data')
model.load_state_dict(lmsd)
Setting grads:
for name, param in model.named_parameters():
conn.tensorset(f'{name}_grad', param.grad.data.cpu().numpy())
I can’t honestly figure out why my global model won’t improve.
I have a work around that involves the global model making a single backward pass after setting the gradients from the worker model (as if to accumulate them) and that seems to be working; but I can’t fully understand who or why.
msd = model.state_dict()
for name, param in model.named_parameters():
msd[name].grad = conn.tensorget(f'{name}_grad')
model.load_state_dict(msd)
# build a single batch input
out = model(input)
criterion(out, label).backward()
optimizer.step()
optimizer.zero_grad()
Does the grad_fn need to be retained for the optimizer to make an update to the weights? I didn’t think it did and that was only at the gradient setting level during the backward pass.
Making a backward pass in the PS seems counter intuitive to the purpose and general workflow of the PS.
Hopefully somebody has some insight as to why the PS is not improving without a backward pass.