Model freezes right at loss.backward()

When I train the model, training freezes at loss.backward() without any error message or warnings. It just freeze… I have print out the loss, and have attached one of the example here
loss tensor(0.6277, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
I am wondering does anyone have faced the same problem and is there any suggestions to solve it?

Thank you!!

1 Like

Could you run your code on the CPU and check, if it’s working or yields any error message?
In case it’s working, could you post an executable code snippet as well as your PyTorch, CUDA, and cudnn version?

I have tested the codes on CPU, and it appears the same issue. Training freezes at loss.backward() yields no error message and warnings.

I made some changes in the forward and the model runs. Basically, I need the main diagonal of a matrix multiplication, so I customized a col_wise_mul function.

Before:

def col_wise_mul(m1, m2):
    result = torch.zeros(0).to('cpu')
    for i in range(m1.shape[1]):
        v1 = m1[:, i, :]
        v2 = m2[:, i]
        v = torch.matmul(v1, v2).unsqueeze(1)
        result = torch.cat((result, v), dim=1)
    return result

x = element_wise_mul(x_feature, label_feature)

Now:

 x = torch.diagonal(torch.matmul(x_feature, label_feature), offset=0).transpose(0, 1)

I am wondering why the customized function does not work? Is there any other ways to accomplish the goal as I have two very large matrixes (400, 30000). If I do multiplication first and then get the diagonal, it cost too much memory.

Are you running out of memory using the first approach?
Also, could you add a print statement inside the loop and check, if the code stops at a specific iteration?

It does not return any out of memory messages when I used the customized function, the model just freezes at the first batch in the first epoch (I have added print statement after each line in order to see which line causes the problem.)

Could you post a minimal code snippet, which would show this behavior in the latest PyTorch version, please?