CPU sleeping after one iteration of loss.backward()

Hi

I could run the code on one machine but then it doesn’t work on another machine with the same pytorch version 0.4.1

But the CPUs all goes into sleep mode after on iteration of loss.backward(), no errors and the program is stuck then.

code is like this:

for i,batchdata in enumerate(train_loader):
optimizer.zero_grad()
x,y=batchdata
y_pred=model(x)
print(x)
print(len(x))
print(y_pred)
loss_train=criterion(y_pred,y)
print(loss_train)
loss_train.backward(retain_graph=True)
print('done batch ').
optimizer.step()
print(“Finished Step”)

Any idea about what might happen here? many thanks in advance.

You code looks ok.
Could you provide a minimal (<100 lines of code) example to repoduce this so that we can look into it in more details?

Hi

Thank you for your quick reply,

for more info:

I’m using a DNN with 2 hidden layers and use cross entropy as loss function, the batch size is 32

criterion=nn.CrossEntropyLoss()

the output of the code in my post is like :

tensor([[0., 0., 0., …, 0., 0., 1.],
[1., 0., 0., …, 0., 1., 1.],
[0., 0., 0., …, 0., 1., 1.],
…,
[1., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 1.],
[1., 0., 0., …, 0., 1., 1.]])
32
tensor([[ 0.6209, -0.8086, 0.1675, …, -0.2908, -1.3713, -1.6336],
[ 0.5608, -0.6456, 0.2424, …, -0.4206, -1.1513, -1.7551],
[ 0.5057, -0.9428, 0.2893, …, -0.5379, -1.2654, -1.5686],
…,
[ 0.6501, -0.6999, 0.3204, …, -0.3298, -1.3569, -1.6235],
[ 0.8948, -0.9117, 0.1544, …, -0.3279, -1.1618, -1.6269],
[ 0.7063, -0.7219, 0.2326, …, -0.3535, -1.2034, -1.7179]],
grad_fn= ThAddmmBackward> )
tensor(5.2543, grad_fn=NllLossBackward>)
done batch 0

The strange thing is when I change the batch size to 1, loss_train.backward() not work at all,

output like this if the batch size is 1:

begin batch 0
tensor([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0.]])
1
tensor([[-0.7247, 1.0524, -2.2573, -1.1343, -0.4040, -0.3480, 1.0350, 1.2908,
1.9561, 0.6749, 0.2105, -1.3460, 1.5483, 1.3202, -1.6915, -2.9539,
0.4566, -0.0062, -0.7644, 1.4134, -2.7010, 0.1742, 3.0185, -0.8109,
-0.1226, -0.1662, 1.5913, -1.1074, 1.0465, -0.3631, 0.7734, -1.1649,
-0.7790, -0.4970, -3.5045, 1.2011, -1.0401, -1.7327, 0.3079, 0.1145,
3.1292, -0.2242, -1.2624, 0.1882, 0.7860]],
grad_fn=ThAddmmBackward>)
tensor(1.6626, grad_fn=NllLossBackward>)

Hi,

This sounds very weird.
I’m afraid we’ll need an example to reproduce this to investigate it further though as I have no idea where this can come from.

hi

Please see below for a example:

from numpy import binary_repr
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as utils
import torch.nn.init as init

class ThreeLayerNet(torch.nn.Module):

def __init__(self,D_in,H,D_out):
    super(ThreeLayerNet,self).__init__()
    self.linear1=torch.nn.Linear(D_in,H)
    self.linear2=torch.nn.Linear(H,H)
    self.linear3=torch.nn.Linear(H,D_out)

def forward(self,x):
    h_relu_1=self.linear1(x).clamp(min=0)
    h_relu_2=self.linear2(h_relu_1).clamp(min=0)
    y_pred=self.linear3(h_relu_2)
    return y_pred

def weight_init(m):

if isinstance(m,nn.Linear):
    init.xavier_normal_(m.weight.data)
    init.normal_(m.bias.data)

if name == ‘main’:

device=torch.device("cpu")
print('a')
D_in,D_out=76,45
H=128
model=ThreeLayerNet(D_in,H,D_out)
print('b')
model.apply(weight_init)
print('c')
criterion=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(model.parameters(),lr=1e-3)

x=torch.tensor([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.,0.,1., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0.]])
y=torch.tensor([2])
print('d')
y_pred=model(x)
print('e')
loss_train=criterion(y_pred,y)
print('start bp')
loss_train.backward(retain_graph=True)
print('done bp')
optimizer.step()
print('done step')

use this code I can only print to d, and then get stuck.

Hi,

I can actually run this code with no issue locally with latest pytorch until “done step”.
Could you try with version 1.0 as it might be a bug already resolved?