Hi all,
I have a linear layer that I initialize in the model class called self.X
Previously in my forward
function, I was using a reshaped version of this layer:
X_nt = self.X.weight.expand(1,-1,-1)
This layer scales quadratically with my input dimensionality, so to scale this up, I figure I would rather directly operate on the original self.X
instead of copying this into a new variable X_nt
These two variables point to different memory addresses, so it seems this is a worthy cause to free up memory (though I could be mistaken here).
However, when I delete X_nt
and directly substitute in its expanded version, my model suddenly becomes a little over twice as slow with much worse performance. I figured there’s something wrong with the way I’m storing gradients.
I turned optimizer.zero_grad
into model.zero_grad
and put this at the end of my training loop to no avail. My forward function is a bit unconventional in that it returns a loss since there’s a time dependency I need to track per forward pass, but the algorithm works fine before this change. Here is the training loop I’m using:
def train_loop(self, dataloader, optimizer, clip_value=-1):
losses = []
self.train()
for batch, data in enumerate(dataloader):
# optimizer.zero_grad()
loss = self(data)
loss.backward()
# default clip value is -1, so we don't clip unless explicitly specified
if clip_value > 0:
torch.nn.utils.clip_grad_norm_(self.parameters(), max_norm=clip_value)
optimizer.step()
losses.append(loss.item())
self.zero_grad()
avg_epoch_loss = np.mean(losses)
return avg_epoch_loss
Any help would be greatly appreciated!