Custom LR scheduler per sample instead of one for the whole dataset, how to couple gradients?

Hello,

I have a generative model that is learning a latent variable using backprop as part of the generation process. For each sample there is a custom latent variable and they have nothing to do with each other. The only reason I use batches with more than one sample is to speed things up.

I want to find out if the performance can be increased by using a learning rate scheduler. However, since the samples are independent, I need one scheduler for each sample. In order to do this, I probably need one parameter group for each sample. Let’s say I have 512 samples in my batch. I have tried the following:

param_groups = [{"params": [latent[i]]} for i in range(latent.shape[0])]
opt = optim.Adam(params=param_groups)

When running this I get

ValueError: can’t optimize a non-leaf Tensor

which makes sense because the original tensor is latent and I am slicing it, so the slice is no leaf anymore. So I tried this instead:

param_groups = [{"params": [latent[i].detach().requires_grad_(True)]} for i in range(latent.shape[0])]
opt = optim.Adam(params=param_groups)

This solves the non-leaf issue, but (of course) the gradient is not able to flow back to the original latent vector anymore. So when I call this in my backprop loop, there is no gradient to the original tensor.

generated = self.do_some_fancy_stuff(latent)
loss = my_loss(generated, groundtruth)
loss.backward()
opt.step()

I could pass each sample seperately through the model and backprop the loss to the slice of the latent tensor, but that is much slower.

So my question is: Is there some way to achieve that the slice latent[i] shares the gradient with the corresponding vector in latent?

I have also tried this:

loss.backward()
for i in range(latent.shape[0]):
    opt.param_groups[i]["params"][0].grad = latent.grad[i]

The idea is to calculate the gradient on the whole batch and then to set the gradient of the slices manually, but then I get this when calling opt.zero_grad():

RuntimeError: Can’t detach views in-place. Use detach() instead

So I tried to replace zero_grad by manually setting grad = None:

for i in range(latent.shape[0]):      
    opt.param_groups[i]["params"][0].grad = None # manual zero_grad
latent.grad = None # manual zero_grad

loss.backward()

for i in range(latent.shape[0]):
    opt.param_groups[i]["params"][0].grad = latent.grad[i]

The code runs without errors, but the results are very bad. There seems to be still an issue with the gradients, but I don’t know what is going wrong.

Or is there any other way to have a custom learning rate for each sample in a batch?

I think I got it. It does not seem like a very clean solution, but it works

param_groups = [{"params": [latent[i].detach().requires_grad_(True)]} for i in range(latent.shape[0])] 
opt = optim.Adam(params=param_groups)

# in my loop
generated = self.do_some_fancy_stuff(latent)
loss = my_loss(generated, groundtruth)

# Do zero_grad manually
for i in range(latent.shape[0]):
    opt.param_groups[i]["params"][0].grad = None
latent.grad = None if latent.grad is None else latent.grad.detach().zero_()

loss.backward()

# Copy the gradients from the slices back to the original tensor
for i in range(latent.shape[0]):
    opt.param_groups[i]["params"][0].grad = latent.grad[i]

# Apply the gradients to the latent vectors
opt.step()

# Copy the new latent vectors from the slices to the original tensor so that it can be used in the next epoch
for i in range(latent.shape[0]):
    latent.data[i] = opt.param_groups[i]["params"][0].data

So basically I have an optimizer with one group for each sample. In my backprop loop I run my model on the original tensor (not the sliced), calculate the loss and gradient, copy the gradient to the slices, run the optimizer and copy the new values from the slices back to the original tensor.

This seems to work, but is this the best way to do it? The performance is not ideal because I am copying data a lot of times. It would be best if there is a way to tell PyTorch that the slices and the original tensor should share the same memory and gradients. Is this possible?

Of course, the LR scheduler still needs to be added now, but that should be not a big problem.

EDIT:
Ok, seems like the LR scheduler is updating each parameter group at the same time. So I will need a separate optimizer and scheduler for each sample. This seems like it is getting a really dirty solution. Am I on the wrong track?