Will it be ok calling loss.backward() within forward() method?

For example, if we design forward() like that:

def forward(self, x, y):
    y_pred = self.w * x
    loss = mse_loss(y_pred, y)
    loss.backward()
    return y_pred

and

do

for epoch in range(10):
    for x, y in data_loader:
        optimizer.zero_grad()
        y_pred = model(x, y)
        optimizer.step()

in training loop.

Will this technically work? Likely.

Is it a good idea? Maybe not if other people are surprised by it.

Best regards

Thomas

Additionally to the already mentioned surprised users, I would also double check if some utils., such as DDP, would still work as these often depend on “standard” training loops. I.e. in this case I wouldn’t know if DDP would synchronize the gradients or not.

Yap, it looks monstrous and maybe idiotic. And I was wondering if DDP will synchronize as well.

But as for some cases:

def forward(self, x, y):
    x = self.in_layer(x)
    loss = 0
    for _ in range(self.n):
        x = self.net(x)
        loss += self.loss_fn(x, y)
    return x, loss

and call loss.backward() outside

I wonder if there is another way to cumulate the grad of those losses than maintain a large computation graph, in order to reduce the memory consumption.

If it’s just better to implement a custom function to avoid calling it in forward(), or even avoid .backward() just autograd and synchronize yourself in it for the eyes of somebody or "aesthetics”. Fine, I will take it.