Efficient way of handling the entire gradient of a network

mzimmer · November 29, 2020, 9:25am

When doing SGD, we can split up the update rule to each parameter tensor in a for loop. Hence, for every layer in a feedforward neural network, we would update weights and biases.

However, I want to implement a different version of SGD where only the k parameters corresponding to the largest gradient entries are actually updated by SGD. The crucial thing here: These should be the k maximal gradients over the entire model parameters. If k=1 then this would mean we only want to update the parameter in the network corresponding to the largest gradient entry, and not to the largest entry in each layer.

How would I find the maximal elements of the entire gradient vector efficiently?
Thanks a lot!

albanD · November 30, 2020, 4:31pm

Hi,

You will most likely have to either do a for loop over all the parameters to find the one with largest value.

If you actually need a topk, it might be simplet to concatenate all the grads in a single Tensor and then call topk on that Tensor.

mzimmer · November 30, 2020, 5:54pm

Thanks for the fast reply. Thats what I am doing now, but I thought there might be a more efficient way. I tried to use the torch.nn.utils.parameters_to_vector function, however it only works with Parameters. Given my solution from below, can this be done more efficiently? My idea was to create a generator once and then in every layer-step of SGD I yield the relevant entries of the mask to multiply the gradient with. Not sure if this is a good approach, it is quite slow compared to SGD. I call the function with the following list param_list = [p for group in self.param_groups for p in group['params'] if p.grad is not None].

    def get_mask_generator(self, param_list):
        """Generator for topk mask"""

        # Get the vector
        grad_vector = torch.cat([torch.abs(p.grad).view(-1) for p in param_list])
        grad_vector_shape = grad_vector.shape
        device = grad_vector.device
        top_indices = torch.topk(grad_vector, k=self.Q).indices
        del grad_vector
        mask_vector = torch.zeros(grad_vector_shape, device=device)
        mask_vector[top_indices] = 1

        # Define the generator (note: the above code is called only once)
        for p in param_list:
            numEl = p.numel()
            partial_mask = mask_vector[:numEl]
            mask_vector = mask_vector[numEl:]
            yield partial_mask.view(p.shape)

Thanks for the help!

albanD · November 30, 2020, 6:58pm

Well parameters_to_vector is doing the same thing as your cat operator. So that will be as efficient.
I don’t think you can do anything better really and manual bookkeeping of the values will most likely end up being more expensive that these few large ops.

I am not sure why you need this to be a generator though as you can directly update all the p.grad inplace from this function no?

mzimmer · November 30, 2020, 7:11pm

Well, it allowed me to modify SGD by adding 3 lines of code, instead of recoding everything for vector gradient updates (i.e. momentum, nesterov etc).

albanD · November 30, 2020, 7:13pm

I’m not sure to follow can’t your code be:

loss = # Your code
opt.zero_grad()
loss.backward()
mask_gradients(model)
opt.step()

So that you don’t have to change the optimizer at all?

mzimmer · November 30, 2020, 7:14pm

Actually you’re pretty damn right. Thanks for the tip.