2x slowdown when directly operating on parameter tensor in forward function

jrh · April 29, 2024, 8:08pm

Hi all,

I have a linear layer that I initialize in the model class called self.X

Previously in my forward function, I was using a reshaped version of this layer:

X_nt = self.X.weight.expand(1,-1,-1)

This layer scales quadratically with my input dimensionality, so to scale this up, I figure I would rather directly operate on the original self.X instead of copying this into a new variable X_nt

These two variables point to different memory addresses, so it seems this is a worthy cause to free up memory (though I could be mistaken here).

However, when I delete X_nt and directly substitute in its expanded version, my model suddenly becomes a little over twice as slow with much worse performance. I figured there’s something wrong with the way I’m storing gradients.

I turned optimizer.zero_grad into model.zero_grad and put this at the end of my training loop to no avail. My forward function is a bit unconventional in that it returns a loss since there’s a time dependency I need to track per forward pass, but the algorithm works fine before this change. Here is the training loop I’m using:

    def train_loop(self, dataloader, optimizer, clip_value=-1):
        losses = []
        self.train()
        for batch, data in enumerate(dataloader):
            # optimizer.zero_grad()
            loss = self(data)
            loss.backward()
            # default clip value is -1, so we don't clip unless explicitly specified
            if clip_value > 0:
                torch.nn.utils.clip_grad_norm_(self.parameters(), max_norm=clip_value)
            optimizer.step()
            losses.append(loss.item())
            self.zero_grad()

        avg_epoch_loss = np.mean(losses)

        return avg_epoch_loss

Any help would be greatly appreciated!

ptrblck · April 30, 2024, 1:27am

I don’t know how you’ve profiled your code but if you are using your GPU, note that CUDA operations are executed asynchronously and you would thus need to synchronize your code before starting the stopping host timers. Otherwise your profile might be misleading. If that’s not an issue, check if the same kernels are called in both versions. If not, try to narrow down which operation or layer changes.

jrh · April 30, 2024, 4:31pm

Thanks a ton for your reply! I will look into this and update my post later in case anyone else has the same issue