I don’t want to change the model structure. For every parameter c I want it to be c = a*b, and I want the backprop to be able to go back to a and b, not stopping at c.
I know some laborious (and perhaps inefficient) ways to do this, for example re-writing the model architecture, or manually calculating a or b in each backprop, but is there a nicer way to do this?
You can just use the nn.Parameter class found here.
For example:
class CustomLinearLayer(nn.Module):
def __init__(self, in_channels, out_channels, bias=True, device = None):
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.A=nn.Parameter(torch.zeros((self.in_channels,self.out_channels), device = device))
self.B=nn.Parameter(torch.zeros((self.in_channels, self.out_channels), device = device))
if bias:
self.bias = nn.Parameter(torch.zeros((self.out_channels), device = device))
else:
self.bias = None
self.reset_parameters()
def reset_parameters(self):
stdv = 1. / (self.A.size(0))**(1/4)
for param in self.parameters():
nn.init.uniform_(param, -stdv, stdv)
if self.bias is not None:
stdv = 1. / (self.A.size(0))**(1/2)
self.bias.data.uniform_(-stdv, stdv)
def forward(self, x):
C = self.A*self.B #elementwise multiplication
if self.bias is not None:
return x@C+self.bias
else:
return x@C
model = CustomLinearLayer(10, 20)
dummy_inputs = torch.rand(100, 10)
print(model(dummy_inputs).size())
Note the weight initialization method, which is provided, will give you fairly decent results for what you’re trying to do(FYI, improper initialization may make the model produce nans or show little to no improvements).
Thanks for the reply.
This is essentially what I did, I went into the huggingface transformer code and wrote a class wrapper of a module to do pretty much what you did. I guess this is the best solution for now. It is burdensome if the model architecture is not original (i.e., if the model architecture is from a big pile of codes, then doing this modification takes a while). What I was hoping was something like:
for param in model.parameters():
paramA = nn.Parameter(..., require_grad = False)
paramB = nn.Parameter()
param = paramA * paramB
and this keeps the original model architecture but replaces each leaf parameter with two a multiplication. Specifically in this application, we have a predefined param A and want to train for paramB.
LoRA might be similar to what you’re doing, except that’s elementwise addition of the outer product of two vectors, instead of multiplication of a tensor. You can see here for coding ideas:
Additionally, lucidrains also has a PyTorch implementation of PaLM with RLHF via using LoRA weights, that you might find interesting to consider, as that is implemented with transformers: