Hello,
I have a generative model that is learning a latent variable using backprop as part of the generation process. For each sample there is a custom latent variable and they have nothing to do with each other. The only reason I use batches with more than one sample is to speed things up.
I want to find out if the performance can be increased by using a learning rate scheduler. However, since the samples are independent, I need one scheduler for each sample. In order to do this, I probably need one parameter group for each sample. Let’s say I have 512 samples in my batch. I have tried the following:
param_groups = [{"params": [latent[i]]} for i in range(latent.shape[0])]
opt = optim.Adam(params=param_groups)
When running this I get
ValueError: can’t optimize a non-leaf Tensor
which makes sense because the original tensor is latent
and I am slicing it, so the slice is no leaf anymore. So I tried this instead:
param_groups = [{"params": [latent[i].detach().requires_grad_(True)]} for i in range(latent.shape[0])]
opt = optim.Adam(params=param_groups)
This solves the non-leaf issue, but (of course) the gradient is not able to flow back to the original latent vector anymore. So when I call this in my backprop loop, there is no gradient to the original tensor.
generated = self.do_some_fancy_stuff(latent)
loss = my_loss(generated, groundtruth)
loss.backward()
opt.step()
I could pass each sample seperately through the model and backprop the loss to the slice of the latent tensor, but that is much slower.
So my question is: Is there some way to achieve that the slice latent[i]
shares the gradient with the corresponding vector in latent
?
I have also tried this:
loss.backward()
for i in range(latent.shape[0]):
opt.param_groups[i]["params"][0].grad = latent.grad[i]
The idea is to calculate the gradient on the whole batch and then to set the gradient of the slices manually, but then I get this when calling opt.zero_grad():
RuntimeError: Can’t detach views in-place. Use detach() instead
So I tried to replace zero_grad by manually setting grad = None
:
for i in range(latent.shape[0]):
opt.param_groups[i]["params"][0].grad = None # manual zero_grad
latent.grad = None # manual zero_grad
loss.backward()
for i in range(latent.shape[0]):
opt.param_groups[i]["params"][0].grad = latent.grad[i]
The code runs without errors, but the results are very bad. There seems to be still an issue with the gradients, but I don’t know what is going wrong.
Or is there any other way to have a custom learning rate for each sample in a batch?