Model freezes right at loss.backward()

xdwang0726 · August 9, 2020, 8:14am

When I train the model, training freezes at loss.backward() without any error message or warnings. It just freeze… I have print out the loss, and have attached one of the example here
loss tensor(0.6277, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward>)
I am wondering does anyone have faced the same problem and is there any suggestions to solve it?

Thank you!!

ptrblck · August 10, 2020, 9:55am

Could you run your code on the CPU and check, if it’s working or yields any error message?
In case it’s working, could you post an executable code snippet as well as your PyTorch, CUDA, and cudnn version?

xdwang0726 · August 10, 2020, 8:58pm

I have tested the codes on CPU, and it appears the same issue. Training freezes at loss.backward() yields no error message and warnings.

xdwang0726 · August 11, 2020, 2:47am

I made some changes in the forward and the model runs. Basically, I need the main diagonal of a matrix multiplication, so I customized a col_wise_mul function.

Before:

def col_wise_mul(m1, m2):
    result = torch.zeros(0).to('cpu')
    for i in range(m1.shape[1]):
        v1 = m1[:, i, :]
        v2 = m2[:, i]
        v = torch.matmul(v1, v2).unsqueeze(1)
        result = torch.cat((result, v), dim=1)
    return result

x = element_wise_mul(x_feature, label_feature)

Now:

 x = torch.diagonal(torch.matmul(x_feature, label_feature), offset=0).transpose(0, 1)

I am wondering why the customized function does not work? Is there any other ways to accomplish the goal as I have two very large matrixes (400, 30000). If I do multiplication first and then get the diagonal, it cost too much memory.

ptrblck · August 11, 2020, 7:46am

Are you running out of memory using the first approach?
Also, could you add a print statement inside the loop and check, if the code stops at a specific iteration?

xdwang0726 · August 14, 2020, 3:18pm

It does not return any out of memory messages when I used the customized function, the model just freezes at the first batch in the first epoch (I have added print statement after each line in order to see which line causes the problem.)

ptrblck · August 14, 2020, 11:49pm

Could you post a minimal code snippet, which would show this behavior in the latest PyTorch version, please?