Model training script emits RuntimeError when upgraded from 1.4.0 to 1.6-nightly

ResidentMario · May 27, 2020, 8:47pm

I tried to upgrade a model training script from a 1.4.0 pytorch environment, in which it is working, to a 1.6.0-nightly pytorch environment, in which it is not.

The fact there are no code changes leads me to believe that there’s some broken behavior in 1.6.0-nightly, which is why I’ve filed this issue as a bug report:

Maybe I’m wrong though? In case I am, would love for someone here to double check my layers. I’m getting a RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation and but I’m pretty sure that none of my operations are in-place? Would love to get a second pair of eyes on this.

Here’s a gist with the model training code: https://gist.github.com/ResidentMario/d3767d5f3944ca95218f83e5ec0f5b44.

albanD · May 27, 2020, 8:57pm

Hi,

You can check the release note for 1.5. In particular the part about fixing the inplace detection code for the optimizer.step() functions for the builtin optimizers.

The issue you have is that the optimizer.step() modifies the weights inplace. But if these weights are needed to compute the backward later, you’ll see the error you see right now.
I am not sure why you do the optimizer.step() at the beginning of the loop in your code, but you should move it to just after the backward if possible. (or at least not between the forward and the backward).

ResidentMario · May 27, 2020, 10:14pm

Wow, thanks for the quick response!

The reason that they were in that order was that I got a nag message from OneCycleLR for placing the optimizer step function ahead of the learning rate scheduler step function. To reverse their order, I copy-pasted the optimizer up to its current position in the script.

After updating the training loop so everything goes at the end:

        for epoch in range(self.n_epochs):
            lvs = []
            for i, (X_batch, y_batch) in enumerate(batches):
                X_batch = X_batch.cuda()
                y_batch = y_batch.cuda()
                
                y_pred = model(X_batch).squeeze()
                loss = self.loss_fn(y_pred, y_batch)
                optimizer.zero_grad()
                
                loss.backward()
                
                lv = loss.detach().cpu().numpy()
                lvs.append(lv)
                if i % 100 == 0:
                    print(f"Epoch {epoch + 1}/{self.n_epochs}; Batch {i}; Loss {lv}")

                optimizer.step()
                scheduler.step()
            print(
                f"Epoch {epoch + 1}/{self.n_epochs}; Average Loss {np.mean(lvs)}"
            )

It works. Thanks for the help! Will definitely be more careful about the ordering of these steps going forward.