Custom Parameter Server (PS) not improving

I am trying to train over a custom parameter server, but it checks all the boxes for setting weights and updating gradients… but for some reason it won’t improve accuracy over my data.

I’m using redisai as my database to host all my training/test data and global and all my worker weights/grads. I know the opinion of using an external database is a performance hit, but this is something I want to do for my own experience.

The PS holds the global model that the workers read from to their local model and update before pushing gradients back to the PS to update the global model.

PS:
Setting weights:

for k,v in model.state_dict().items():
    conn.tensorset(k,v.cpu().numpy())

Getting gradients:

msd = model.state_dict()
for name, param in model.named_parameters():
    msd[name].grad = conn.tensorget(f'{name}_grad')
model.load_state_dict(msd)
optimizer.step()
optimizer.zero_grad()

Worker:
Getting weights:

lmsd = model.state_dict()
for k,v in model.state_dict().items():
    lmsd[name].data.copy_(conn.tensorget(f'{name}_data')
model.load_state_dict(lmsd)

Setting grads:

for name, param in model.named_parameters():
    conn.tensorset(f'{name}_grad', param.grad.data.cpu().numpy())

I can’t honestly figure out why my global model won’t improve.

I have a work around that involves the global model making a single backward pass after setting the gradients from the worker model (as if to accumulate them) and that seems to be working; but I can’t fully understand who or why.

msd = model.state_dict()
for name, param in model.named_parameters():
    msd[name].grad = conn.tensorget(f'{name}_grad')
model.load_state_dict(msd)
# build a single batch input
out = model(input)
criterion(out, label).backward()
optimizer.step()
optimizer.zero_grad()

Does the grad_fn need to be retained for the optimizer to make an update to the weights? I didn’t think it did and that was only at the gradient setting level during the backward pass.

Making a backward pass in the PS seems counter intuitive to the purpose and general workflow of the PS.

Hopefully somebody has some insight as to why the PS is not improving without a backward pass.

The first evident error in your code is:

msd = model.state_dict()
for name, param in model.named_parameters():
    msd[name].grad = conn.tensorget(f'{name}_grad')
model.load_state_dict(msd)

Since load_state_dict will not load the gradient, but only load registered nn.Parameter() and buffers of model, you will have to iterate parameters of your model directly and put gradients there like:

                ...
                # Assign gradients to the managed model and
                # perform optimization.
                if self.model is not None and self.optimizer is not None:
                    self.optimizer.zero_grad()
                    with t.no_grad():
                        for k, v in self.model.parameters():
                            v.grad = grad_dict[k].to(v.device)
                    self.optimizer.step()

Therefore, the “solution” you have discovered is basically performing a optimization step on your “PS” end, but your pushed gradients are not utilized.

2 Likes

Thanks, @iffiX
That definitely helps leverage the workers gradients - the model is able to find solutions A LOT faster now the model is actually able to explore different possibilities.

I tried it without the backward step at the PS and it causes predictions to become nan. However, if I keep the backward step in and actually (now) accumulate the gradients from the worker and the single batch in the PS, then the model is able to find solutions.

Once setting the gradient directly into the global model, technically I should be able to perform an optimizer step and update the global weights.

Any idea why the predictions result in nan when there isn’t a backward step in the PS?

According to my observation, nan could be caused by a lot of things: inappropriately designed reward, the model itself, invalid input data etc.

I would suggest you print “.grad” attribute for every parameter out, along with their comming source (worker rank), normally before your nan occurs you will see several ridiculously large gradients like 1e8 or 1e11.

1 Like

I think I fixed that issue. I didn’t seem to be passing the right weights from the PS to the workers.

for k, v in self.model.state_dict().items():
    self.conn.tensorset(k, v.cpu().numpy())

Which is what I wrote above, but I actually had v.data.cpu().numpy() in my code.

Now my problem is that my model gives empty predicts.

I am distributing a UNet for image segmentation. I’m utilizing gradient accumulation as a tradeoff for the framework performance using a redisai db. Even though the weights and gradients are being passed properly now, my model is decreasing it’s loss… but also accuracy (IoU) md resulting in empty predictions.

It might be the gradient accumulation and batch normalization or it could be the ReLU activations on the decoding half of the model.

Unless someone knows how UNets nuances perform over a distributed framework?

Hmmm, I have not tested UNet in a distributed scenario, that’s the realm of real scientific studies :blush:
Maybe reddit of comuper vision is a better place? Or try to ask a new question in the “comuter vision” block of the pytorch forum?

1 Like

https://discuss.pytorch.org/t/distributed-unet-over-parameter-server-output-empty/89082

Link to that question for anyone in the future’s use.