Optimising a huge number of parameters

I am trying to optimise a large number of parameters that are optimised using gradient descent. These parameters are not computed but stored in memory. They have the same size as the dataset.

I was wondering if there was a way to batch these parameters along with the dataset? The trouble is that they have to be changed via gradient descent and then rewritten into memory.

How many parameters are you talking about? If it’s a memory constraint and your gradients are purely calculated via loss.backward() then you can use checkpointing (see documentation here)

The number of parameters are a tensor of the following shape:

[size of dataset, number of layers, hidden dimension, num_parameters]

where “num_parameters” is a hyperparameter that is chosen before training.

Will I have to checkpoint after each backward pass? The gradients are indeed purely calculated via loss.backward().

There’s a tutorial for using torch.utils.checkpointing here. Try and follow that, and see if it’s applicable to your use case!


Unfortunately I don’t think checkpointing will work for my use case. The reason for this is that my parameters are not computed in the forward pass but are stored in memory and updated in memory. For clarity, they are initialised something like this:

       params = nn.Parameter(th.zeros(
            size of dataset, number of layers, hidden dimension, num_parameters

They are then updated by updating the part of the tensor params that corresponds to the indices of my batched data.

I was wondering if there was a way to batch this tensor in memory as is done with my data. The difficulty is that I have to updated the tensor above and then rewrite it, which is clearly not done to any data.