loss.backward() keeps running for hours

I am using pytorch to train some x-ray images but I ran into the following issue:

in the line : loss.backward() , the program just keeps running and never end, and there is no error or warning.

            loss, outputs = self.forward(images, targets)  
            loss = loss / self.accumulation_steps
            print("loss calculated: " + str(loss))

            if phase == "train":
                print("running loss backwarding!") 
                loss.backward()
                print("loss is backwarded!")
                if (itr + 1 ) % self.accumulation_steps == 0:
                    self.optimizer.step()
                    self.optimizer.zero_grad()

The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>) .

Could anyone help me with why this keeps running or any good ways to debug the backward() function?

I am using torch 1.2.0+cu92 with the compatible cuda 10.0.

Thank you so much!!

Hi,

This is quite unexpected.
We would need more information though to reproduce this.
Could you share a small code repro that runs on colab: https://colab.research.google.com/notebook#create=true ?

I upgraded my pytorch to the latest version which is now: torch 1.4.0 torchvision 0.5.0 with cuda 10.1. . And now it runs pretty smooth.

thank you !

1 Like